1 Introduction

Physical laws are often presented by the means of differential equations. The original discoveries of differential equations associated with real-world physical processes typically require a good understanding of the physical laws, and supportive evidence from empirical observations. We consider an inverse problem of this—from the experimental data, how to directly recognize the underlying PDE. We combine tools from machine learning and numerical PDEs to explore the given data and automatically identify the underlying dynamics.

Let \( \{u_{i}^n | i=1, \ldots , N_1 \text { and } n= 1, \ldots , N_2\}\) be the given discrete time dependent data, where the index i and n represent the spatial and time discrete domain, respectively. Our objective is to find the differential equation, i.e., an operator \(\mathcal {F}\):

$$\begin{aligned} u_t = \mathcal {F}(x,u,u_x,u_{xx}) \text { such that } u(x_i,t_n) \approx u_i^n. \end{aligned}$$

Recently there have been a number of important works on learning dynamical systems or differential equations. Two pioneering works can be found in [3, 29], where symbolic regression was used to recover the underlying physical systems from experimental data. In [6], Brunton et al. considered the discovery of nonlinear dynamical systems with sparsity-promoting techniques. The underlying dynamical systems are assumed to be governed by a small number of active terms in a prescribed dictionary, and sparse regression is used to identify these active terms. Various extensions of this sparse regression approach can be found in [13, 16, 20, 25]. In [26], Schaeffer considered the problem of learning PDEs using the spectral method, and focused on the benefit of using \(L^1\) minimization for sparse coefficient recovery. Highly corrupted and undersampled data are considered in [28, 31] for the recovery of dynamical systems. In [28], Schaeffer et al. developed a random sampling theory for the selection dynamical systems from undersampled data. These nice series of works focused on the benefit and power of using \(L^1\) minimization to resolve dynamical systems or PDEs with certain sparse pattern [27]. A Bayesian approach was considered in [35] where Zhang et al. used dimensional analysis and sparse Bayesian regression to recover the underlying dynamical systems. Another related problem is to infer the interaction function in a system of agents from the trajectory data. In [4, 18], nonparametric regression was used to predict the interaction function and a theoretical guarantee was established.

There are approaches using deep learning techniques. In [17], Long et al. proposed a PDE-Net to learn differential operators by learning convolution kernels. In [24], Raissi et al. used neural networks to learn and predict the solution of the equation without finding its explicit form. In [23], neural networks were further used to learn certain parameters in the PDEs from the given data. In [21], Residual Neural Networks (ResNet) are used as building blocks for equation approximation. In [14], neural networks are used to solve the wave equation based inverse scattering problems by providing maps between the scatterers and the scattered field (and vice versa). Related works showing the advantages of deep learning include [14, 19, 21, 22].

In this paper, we propose a new algorithm based on the convergence principle of numerical PDE schemes. We assume that the governing PDE is a linear combination of few differential terms in a prescribed dictionary, and the objective is to find the correct set of coefficients. We use finite difference methods, such as the 5-point ENO scheme, to approximate the spatial derivatives in the dictionary. While we utilize \(L^1\) minimization to aid the efficiency of the approach, the main idea is to validate and correct the results by Time Evolution Error (TEE). This approach, we call Identifying Differential Equations with Numerical Time evolution (IDENT) is explored for data with non-periodic boundary conditions, noisy data and PDEs with varying coefficients for nonlinear PDE identification. For noisy data, we propose an order preserving denoising method called Least Square Moving Average (LSMA) to effectively denoise the given data. To tackle varying coefficients, we expand the number of coefficients in terms of finite element bases. This procedure called Base Element Expansion (BEE), again uses the fundamental idea of convergence in finite element approximation. From a theoretical perspective, we establish a performance guarantee based on an incoherence property, and define a new noise-to-signal ratio for the PDE identification problem. Contributions of this paper include:

  1. 1.

    establishing a new direction of using numerical PDE techniques for PDE identification,

  2. 2.

    proposing a flexible approach which can handle different boundary conditions, are robust against noise, and can identify nonlinear PDEs with varying coefficients,

  3. 3.

    establishing a recovery theory of Lasso, which leads to a new definition of the noise-to-signal ratio for PDE identification,

  4. 4.

    systematically analyzing the effects of noise and downsampling, and proposing a new denoising method called Least Square Moving Average (LSMA).

This paper is organized as follows: The main algorithm is presented in Sect. 2. Specifically, the set-up of the problem is presented in Sect. 2.2; details of the IDENT algorithm are in Sect. 2.3; a recovery theory for Lasso and the new noise-to-signal ratio are given in Sect. 2.4; and the first set of numerical experiments are in Sect. 2.5. The aspects of denoising and downsampling effects are presented in Sect. 3. In Sect. 3, the LSMA denoising method is introduced in Sect. 3.1; numerical experiments for noisy data are given in Sect. 3.2, and downsampling effects are considered in Sect. 3.3. The identification of PDEs with varying coefficients are presented in Sect. 4. In Sect. 4, we consider nonlinear PDEs with varying coefficients and introduce BEE motivated by finite element approximation. A concluding remark is given in Sect. 5 and some details are in the “Appendix”.

2 Identifying Differential Equations with Numerical Time Evolution (IDENT)

2.1 Notations

We use bold letter to denote vectors, such as \(\mathbf {a},\mathbf {b}\). The support of a vector \(\mathbf {x}\) is the set of indices at which the corresponding entry is nonzero: \(\mathrm{supp}(\mathbf {x}) := \{j : x_j \ne 0\}\). We use \(A^T\) and \(A^*\) to denote the transpose and the conjugate transpose of the matrix A. We use \(x \rightarrow \varepsilon ^+\) to denote \(x > \varepsilon \) and \(x\rightarrow \varepsilon \). Let \(\mathbf {f} = \{f(x_i,t_n) | i=1,\ldots ,N_1, n=1,\ldots ,N_2\} \in {\mathbb {R}}^{N_1N_2}\) be samples of a function \(f: \mathcal {D}\times [0,\infty ) \rightarrow {\mathbb {R}}\) with spatial spacing \(\varDelta x\) and time spacing \(\varDelta t\). The integers \(N_1\) and \(N_2\) are the total number of spatial and time discretization respectively. We assume PDEs are simulated on the grid with time spacing \(\delta t\) and spatial spacing \(\delta x\), while data are sampled on the grid with time spacing \(\varDelta t\) and spatial spacing \(\varDelta x\). The vector \(L^p\) norm of \(\mathbf {f}\) is \(\Vert \mathbf {f}\Vert _p = (\sum _{i=1}^{N_1}\sum _{n=1}^{N_2}|f(x_i,t_n)|^p )^{1/p} \). Denote \(\Vert \mathbf {f}\Vert = \Vert \mathbf {f}\Vert _2\). The function \(L^p\) norm of \(\mathbf {f}\) is \(\Vert \mathbf {f}\Vert _{L^p} = (\sum _{i=1}^{N_1}\sum _{n=1}^{N_2}|f(x_i,t_n)|^p \varDelta x \varDelta t )^{1/p} \). Notice that \(\Vert \mathbf {f}\Vert _{L^p} = \Vert \mathbf {f}\varDelta x^{1/p} \varDelta t^{1/p}\Vert _p\).

2.2 The Set-Up of the Problem

We consider parametric PDEs where \(\mathcal {F}(x,u,u_x,u_{xx})\) is a linear combination of monomials such as 1, u, \(u^2\), \(u_x\), \(u_x^2\), \(uu_x\), \(u_{xx}\), \(u_{xx}^2\), \(uu_{xx}\), \(u_x u_{xx}\) with coefficients \(\mathbf {a}= \{a_j\}_{j=1}^{10}\):

$$\begin{aligned} u_t = a_1 + a_2 u + a_3 u^2 + a_4 u_x + a_5 u_x^2+ a_6 u u_x + a_7 u_{xx} + a_8 u^2_{xx} + a_9 u u_{xx} + a_{10} u_x u_{xx}.\nonumber \\ \end{aligned}$$
(1)

We refer to each monomial as a feature, and let \(N_3\) be the number of features, i.e., \(N_3 = 10\) in (1). The right hand side can be viewed as a second-order Taylor expansion of \(\mathcal {F}(u,u_x,u_{xx})\). It can easily be generalized to higher-order Taylor expansions, and operators \(\mathcal {F}(u,u_x,u_{xx},u_{xxx}, \partial _x^4 u,\ldots )\) depending on higher order derivatives. This model contains a rich class of differential equations, e.g., the heat equation, transport equation, Burgers’ equation, KdV equation, Fisher’s equation that models gene propagation.

Evaluating (1) at discrete time and space \((x_i,t_n), i=1,\ldots ,N_1, n = 1,\ldots ,N_2\) yields the discrete linear system

$$\begin{aligned} F \mathbf{a } = \mathbf {b}, \end{aligned}$$

where

$$\begin{aligned} \mathbf {b} = \{u_t(x_i,t_n)| i=1,\ldots ,N_1, n = 1,\ldots ,N_2\} \in {\mathbb {R}}^{N_1N_2}, \end{aligned}$$

and F is a \(N_1N_2 \times N_3 \) feature matrix in the form of

$$\begin{aligned} F = \left( \begin{array}{cccccccccc} \vdots &{} \vdots &{} \vdots &{} \vdots &{}\vdots &{} \vdots &{}\vdots &{} \vdots &{}\vdots &{} \vdots \\ 1 &{} u(x_i,t_n) &{} u^2(x_i,t_n) &{} u_x(x_i,t_n) &{} u_x^2(x_i,t_n) &{} uu_x(x_i,t_n) &{} u_{xx}(x_i,t_n) &{} u_{xx}^2(x_i,t_n) &{} u u_{xx}(x_i,t_n) &{} u_x u_{xx}(x_i,t_n) \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{}\vdots &{} \vdots &{}\vdots &{} \vdots &{}\vdots &{} \vdots \\ \end{array} \right) .\nonumber \\ \end{aligned}$$
(2)

We use F[j] to denote the jth column associated with the jth feature evaluated at \((x_i,t_n),\) \(i=1,\ldots ,N_1,\) \(n=1,\ldots ,N_2\)

The objective of PDE identification is to recover the unknown coefficient vector \(\mathbf {a} \in {\mathbb {R}}^{N_3}\) from the given data. Real world physical processes are often presented with a few number of features in the right hand side of (1), so it is reasonable to assume that the coefficients are sparse.

For differential equations with varying coefficients, we consider PDEs of the form

$$\begin{aligned} u_t= & {} a_1(x) + a_2 (x) u + a_3 (x) u^2 + a_4(x) u_x + a_5(x) u_x^2+ a_6(x) u u_x \nonumber \\&\quad + a_7(x) u_{xx} + a_8 (x) u^2_{xx} + a_9 (x) u u_{xx} + a_{10}(x) u_x u_{xx} \end{aligned}$$
(3)

where each \(a_j(x)\) is a function on the spatial domain of the PDE. We expand the coefficients in terms of finite element bases \(\{ \phi _l\}_{l=1}^{L} \) such that

$$\begin{aligned} a_j (x) \approx \sum _{l=1}^{L} a_{j,l} \phi _l(x) \text { for } j=1,\ldots ,N_3, \end{aligned}$$
(4)

where L is the number of finite element bases used to approximate \(a_j(x)\). Let \(y_1<y_2<\cdots <y_L\) be a partition of the spatial domain. We use a typical finite element basis function, e.g., \(\phi _l(x)\) is continuous, and linear within each subinterval \((y_i, y_{i+1})\), and \(\phi _l(y_i)=\delta _{li}=1\) if \(i=l\); 0 otherwise. If the \(a_j(x)\)’s are Lipchitz functions, and finite element bases are defined on a grid with spacing O(1/L). The approximation error of the \(a_j(x)\)’s satisfies

$$\begin{aligned} \Vert a_j - \sum _{l=1}^{L} a_{j,l} \phi _l\Vert _{L^p} \le O(1/L), \ p \in (0,\infty ). \end{aligned}$$
(5)

In the case of varying coefficients, the feature matrix F is of size \(N_1 N_2 \times N_3 L\),

$$\begin{aligned} F = \left( \begin{array}{ccc|ccc|c|ccc} \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots &{} &{} \vdots &{} &{} \vdots \\ \phi _1(x_i) &{} \dots &{} \phi _{L}(x_i) &{} u(x_i,t_n)\phi _1(x_i) &{} \dots &{} u(x_i,t_n)\phi _{L}(x_i) &{} \dots &{} u_xu_{xx}(x_i,t_n)\phi _1(x_i) &{} \dots &{}u_x u_{xx}(x_i,t_n)\phi _{L}(x_i) \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots &{} &{} \vdots &{} &{} \vdots \\ \end{array} \right) , \end{aligned}$$
(6)

and the vector to be identified is

$$\begin{aligned} \mathbf{a } = \left( a_{1,1},\ldots ,a_{1,L} | a_{2,1}, \ldots , a_{2,L} | \ldots | a_{N_3,1}, \ldots , a_{N_3,L} \right) ^T \in \mathrm {R}^{N_3 L}. \end{aligned}$$

The feature matrix F has a block structure. We use F[jl] to denote the column of F associated with the jth feature and the lth basis. To be clear, F[j] is the jth column of (2), and F[jl] is the \((j-1)L+l\)th column of (6). Evaluating (3) at \((x_i,t_n), i=1,\ldots ,N_1, n = 1,\ldots ,N_2\) yields the discrete linear system

$$\begin{aligned} F \mathbf {a}= \mathbf {b}+ \varvec{\eta }, \end{aligned}$$

where \(\varvec{\eta }= \{\eta (x_i,t_n) | i=1,\ldots ,N_1, n=1,\ldots ,N_2\} \in {\mathbb {R}}^{N_1 N_2}\) represents the approximation error of the \(a_j(x)\)’s by finite element bases such that

$$\begin{aligned} \eta (x_i,t_n) = \left( \sum _{l=1}^L a_{1,l}\phi _l(x_i) - a_1(x_i)\right) + \cdots + \left( \sum _{l=1}^L a_{10,l}\phi _l(x_i) - a_{10}(x_i)\right) u_xu_{xx} (x_i,t_n). \end{aligned}$$

In the case that \(u,u_x,u_{xx}\) are uniformly bounded,

$$\begin{aligned} \Vert \varvec{\eta }\Vert _{L^p} \le O(1/L), \ p \in (0,\infty ), \end{aligned}$$

and \(\varvec{\eta }= 0\) when all coefficients are constants.

2.3 The Proposed Algorithm: IDENT

In this paper, we assume that only the discrete data \(\{u_{i}^n | i=1, \ldots , N_1\text { and } n= 1, \ldots , N_2\}\) and the boundary conditions are given. If data are perfectly generated and there is no measurement noise, \(u_{i}^n = u(x_i,t_n)\) for every i and n, and we outline the proposed IDENT algorithm in this section assuming the given data do not have noise.

The first step of IDENT is to construct the empirical version of the feature matrix F and the vector \(\mathbf {b}\) containing time derivatives from the given data. The derivatives are approximated by finite difference methods which gives flexibility in dealing with different types of PDEs and boundary conditions (e.g. non-periodic). We approximate the time derivative \(u_t\) by a first-order backward difference scheme:

$$\begin{aligned} u_t (x_j,t_k) \approx \widehat{u_t} (x_j,t_n) := \frac{u(x_j,t_n) - u(x_j,t_{n-1})}{\varDelta t}, \end{aligned}$$

which yields the error

$$\begin{aligned} \widehat{u_t} (x_j,t_n) = {u_t} (x_j,t_n) + O(\varDelta t). \end{aligned}$$

Let \({\widehat{\mathbf {b}}}\) be the empirical version of \(\mathbf {b}\) constructed from data:

$$\begin{aligned} \widehat{\mathbf {b}} = \{\widehat{u_t}(x_i,t_n): i=1,\ldots ,N_1, n = 1,\ldots ,N_2\} \in {\mathbb {R}}^{N_1N_2}. \end{aligned}$$

We approximate the spatial derivative \(u_x\) through the five-point ENO method proposed by Harten et al. [11]. Let \(\widehat{u_x} (x_j,t_n)\) and \(\widehat{u_{xx}} (x_j,t_n)\) be approximations of \({u_x} (x_j,t_n)\) and \({u_{xx}} (x_j,t_n)\) by the five-point ENO method which yields the error:

$$\begin{aligned} \widehat{u_x} (x_j,t_n) = {u_x} (x_j,t_n) + O(\varDelta x^4), \quad \widehat{u_{xx}} (x_j,t_n) = {u_{xx}} (x_j,t_n) + O(\varDelta x^3). \end{aligned}$$

Putting the \(\widehat{u_x} (x_j,t_n)\)’s and \(\widehat{u_{xx}} (x_j,t_n)\)’s to the feature matrix F in (6) gives rise to the empirical feature matrix, denoted by \(\widehat{F}\). For example, the second column of \(\widehat{F}\) is given by \(\{u_i^n | i=1,\ldots ,N_1 \text { and } n=1,\ldots ,N_2\}\) as an approximation of \(\{u(x_i,t_n) | i=1,\ldots ,N_1 \text { and } n=1,\ldots ,N_2\}\) as follows

$$\begin{aligned} (u_1^1, u_2^1, \ldots ,u_{N_1}^1, u_1^2, \ldots , u_{N_1}^2, \ldots , u_1^{N_2}, \ldots ,u_{N_1}^{N_2})^T \in \mathrm {R}^{N_1 N_2}. \end{aligned}$$

These empirical quantities give rise to the linear system

$$\begin{aligned} {\widehat{F}} \mathbf {a}= {\widehat{\mathbf {b}}} + \mathbf {e}, \quad \mathbf {e} = \mathbf {b}-\widehat{\mathbf {b}} + (\widehat{F}-F)\mathbf {a}+\varvec{\eta }, \end{aligned}$$
(7)

where the terms \(\mathbf {b}-\widehat{\mathbf {b}}\), \((\widehat{F}-F)\mathbf {a}\) and \(\varvec{\eta }\) arise from errors in approximating time and spatial derivatives, and the finite element expansion of varying coefficients, respectively. The total error \(\mathbf {e}\) satisfies

$$\begin{aligned} \Vert \mathbf {e}\Vert _{L^2} \le \varepsilon \text { such that } \varepsilon = O(\varDelta t + \varDelta x^3 + 1/L). \end{aligned}$$
(8)

The second step is to find possible candidates for the non-zero coefficients of \( \mathbf {a}\). We utilize \(L^1\)-regularized minimization, also known as Lasso [30] or group Lasso [34], solved by Alternating Direction Method of Multipliers [2, 5] to get a sparse or block-sparse vector. We minimize the following energy:

$$\begin{aligned} \widehat{\mathbf {a}}_{\text {G-Lasso}}(\lambda ) ={\textstyle \arg \min _{\mathbf {z}}} \left\{ \frac{1}{2} \Vert {\widehat{\mathbf {b}}} - {\widehat{F}}_{\infty } \mathbf{z } \Vert ^2_2 + \lambda \sum _{j=1}^{N_3} \left( \sum _{l=1}^L |z_{j,l}|^2 \right) ^{\frac{1}{2}} \right\} , \end{aligned}$$
(9)

where \(\lambda \) is a balancing parameter between the first fitting term and the second regularization term. The matrix \(\widehat{F}_\infty \) is obtained from \(\widehat{F}\) with each column divided by the maximum magnitude of the column, namely, \(\widehat{F}_\infty [j,l] = \widehat{F}[j,l]/\Vert \widehat{F}[j,l]\Vert _{\infty }\). We use Lasso for the constant coefficient case where \(L=1\), and group Lasso for the varying coefficient case \(L>1\). A set of possible active features is selected by thresholding the normalized coefficient magnitudes:

$$\begin{aligned} {\widehat{\varLambda }}_{\tau } : = \left\{ j : \Vert \widehat{F}[j]\Vert _{L^1}\left\| \sum _{l=1}^L \frac{\widehat{\mathbf {a}}_{\text {G-Lasso}}(\lambda )_{j,l}}{\Vert \widehat{F}[j,l]\Vert _\infty } \phi _l \right\| _{L^1} \ge \tau \right\} . \end{aligned}$$
(10)

with a fixed thresholding parameter \(\tau \ge 0\).

The final step is to identify the correct support using the Time Evolution Error (TEE). (i) From the candidate coefficient index set \({\widehat{\varLambda }}_{\tau }\), consider every subset \(\varOmega \subseteq {\widehat{\varLambda }}_{\tau }\). For each \(\varOmega = \{j_1, j_2, \ldots , j_k\}\), find the coefficients \(\widehat{\mathbf{a }} = (0, 0, \widehat{a}_{j_1}, 0, \ldots , \widehat{a}_{j_k}, \ldots )\) by a least-square fit such that \(\widehat{\mathbf {a}}_{\varOmega } = \widehat{F}_{\varOmega }^\dagger {\widehat{\mathbf {b}}}\) and \(\widehat{\mathbf {a}}_{\varOmega ^\complement } = \mathbf {0}\). (ii) Using these coefficients, we construct the differential equation and numerically time evolve

$$\begin{aligned} u_t = \mathcal {F} \widehat{\mathbf{a }}, \end{aligned}$$

starting from the given initial data, for each \(\varOmega \). It is crucial to use a smaller time step \(\widetilde{\varDelta t}\ll \varDelta t\), where \(\varDelta t\) is the time spacing of the given data. TEE follows the principle of numerical convergence that the numerical solution will converge to the true PDE as the time step goes to zero. We use the first-order Euler scheme for time evolution with time step \(\widetilde{\varDelta t}=O(\varDelta x^r)\) where r is the highest order of the spatial derivatives in the PDE associated with \(\widehat{\mathbf {a}}\). When the candidate PDE contains high-order spatial derivatives, this is necessary to stabilize the explicit scheme used for the TEE evolution. (iii) Finally, calculate the time evolution error for each \(\widehat{\mathbf{a }}\):

$$\begin{aligned} \text { TEE} (\widehat{\mathbf{a }}) := \sum _{i=1}^{N_1} \sum _{n=1}^{N_2} |\bar{u}_i^n - u_i^n| \varDelta x \varDelta t, \end{aligned}$$

where \(\bar{u}_i^n\) is the numerically time evolved solution at \((x_i,t_n)\) of the PDE with the coefficient \(\widehat{\mathbf{a }}\). We pick the subset \(\varOmega \) and the corresponding coefficients \(\widehat{\mathbf {a}}\), which give the smallest TEE, and denote the recovered support as \(\widehat{\varLambda }\). This is the output of the algorithm, which is the identified PDE. Algorithm 1 summarizes this procedure.

figure a

We note that it is possible to skip the \(L^1\) minimization step, and use TEE to recover the support of coefficients by considering all possible combinations from the beginning, however, the computational cost is very high. The \(L^1\) minimization helps to reduce the number of combinatorial trials, and make IDENT more computationally efficient. On the other hand, while \(L^1\) minimization is effective in finding a sparse vector, \(L^1\) alone is often not enough: (i) Zero coefficients in the true PDE may become non-zero in the minimizer of \(L^1\). (ii) If active terms are chosen by a thresholding, results are sensitive to the choice of thresholding parameter, e.g., \(\tau \) in (10). (iii) The balancing parameter \(\lambda \) can affect the results. (iv) If some columns of the empirical feature matrix \(\widehat{F}\) are highly correlated, Lasso is known to have a larger support than the ground truth [9]. TEE refines the results from Lasso, and relaxes the dependence on the parameters.

There are two fundamental ideas behind TEE:

  1. 1.

    For nonlinear PDEs, it is impossible to isolate each term separately to identify each coefficient. Any realization of PDE must be understood as a set of terms.

  2. 2.

    If the underlying dynamics are identified by the true PDE, any refinement in the discretization of the time domain should not deviate from the given data. This is the fundamental principle of consistency, convergence and stability of a numerical scheme, and the reason for choosing a smaller time step \(\widetilde{\varDelta t}\ll \varDelta t\).

Therefore, the main effect of TEE is to evolve the numerical error from the wrongly identified differential terms. This method can be applied to linear or nonlinear PDEs. The effectiveness of TEE can be demonstrated with an example. Assume that the solution u is smooth and decays sufficiently fast at infinity, and consider the following linear equation with constant coefficients:

$$\begin{aligned} \frac{\partial u}{\partial t} = a_0 u + a_1 \frac{\partial u}{\partial x} +\cdots + a_m \frac{\partial ^m u}{\partial x^m}. \end{aligned}$$

After taking the Fourier transform for the equation and solving the ODE, one can obtain the transformed solution:

$$\begin{aligned} {\hat{u}}(\xi ,t)={\hat{u}}(\xi ,0) e^{a_0 t} e^{a_1\mathbf {i} \xi t} e^{-a_2 \xi ^2 t }\cdots e^{a_m (\mathbf {i}\xi )^m t}, \end{aligned}$$

where \(\mathbf {i}=\sqrt{-1}\) and \(\xi \) is the variable in the Fourier domain. If a term with an even-order derivative, such as \(a_2 \frac{\partial ^2 u}{\partial x^2}\), is mistakenly included in the PDE, it will make every frequency mode grow or decrease exponentially in time; if a term with an odd-order derivative, such as \(a_1 \frac{\partial u}{\partial x}\), is mistakenly included in the solution, it will introduce a wrong-speed oscillation of the solution. In either case, the deviation from the correct solution grows fast in time, providing an efficient way to distinguish the wrong terms. Our numerical experiments show that TEE is an effective tool to correctly identify the coefficients. Our first set of experiments are presented in Sect. 2.5.

2.4 Recovery Theory of Lasso, and New Noise-to-Signal Ratio (NSR)

In this subsection, we establish a performance guarantee of Lasso for the identification of PDEs with constant coefficients. In the Step 2 of IDENT, Lasso is applied as \(L^1\) regularization in (9). We consider the incoherence property proposed in [8], and follow the ideas in [10, 32, 33] to establish a recovery theory. While the details of the proof is presented in Appendix 1, here we state the result which leads to a new definition of noise-to-signal ratio.

For PDEs with constant coefficients, we set \(L=1\) in (4), and consider the standard Lasso:

$$\begin{aligned} \widehat{\mathbf {a}}_{\text {Lasso}}(\lambda ) ={\textstyle \arg \min _{\mathbf {z}}} \left\{ \frac{1}{2} \Vert {\widehat{\mathbf {b}}} - {\widehat{F}}_\infty \mathbf{z} \Vert ^2_2 + \lambda \Vert \mathbf {z}\Vert _1 \right\} . \end{aligned}$$
(Lasso)

If all columns of \(\widehat{F}\) are uncorrelated, \(\mathbf {a}\) can be robustly recovered by Lasso. Let \(\widehat{F}= [\widehat{F}[1] \ \widehat{F}[2] \ \ldots \ \widehat{F}[N_3]]\) where \(\widehat{F}[j]\) stands for the jth column of \(\widehat{F}\) in (2). To measure the correlation between the jth and the lth column of \(\widehat{F}\), we use the pairwise coherence

$$\begin{aligned} \mu _{j,l}(\widehat{F}) = \frac{|\langle \widehat{F}[j] , \widehat{F}[l] \rangle |}{\Vert \widehat{F}[j]\Vert _2 \Vert \widehat{F}[l]\Vert _2} \end{aligned}$$

and the mutual coherence of \(\widehat{F}\) as in [8]:

$$\begin{aligned} \mu (\widehat{F}) = \max _{j \ne l} \mu _{j,l}(\widehat{F}) = \max _{j\ne l} \frac{|\langle \widehat{F}[j], \widehat{F}[l] \rangle |}{\Vert \widehat{F}[j]\Vert _2 \Vert \widehat{F}[l]\Vert _2}. \end{aligned}$$

Since normalization does not affect the coherence, we have \( \mu _{j,l}(\widehat{F}_\infty )= \mu _{j,l}(\widehat{F})\) and \(\mu (\widehat{F}_\infty )=\mu (\widehat{F})\). The smaller \(\mu (\widehat{F})\), the less correlated are the columns of \(\widehat{F}\), and \(\mu (\widehat{F}) = 0\) if and only if the columns are orthogonal. Lasso will recover the correct coefficients if \(\mu (\widehat{F})\) is sufficiently small.

Theorem 1

Let \(\mu =\mu (\widehat{F})\), \(w_{\max } = \max _j \Vert \widehat{F}[j]\Vert _\infty \Vert \widehat{F}[j]\Vert _{L^2}^{-1}\) and \(w_{\min } = \min _j \Vert \widehat{F}[j]\Vert _\infty \Vert \widehat{F}[j]\Vert _{L^2}^{-1}\). Suppose the support of \(\mathbf {a}\) contains no more than s indices, \(\mu (s-1) < 1\) and

$$\begin{aligned} \frac{\mu s}{1-\mu (s-1)} < \frac{w_{\min }}{w_{\max }}. \end{aligned}$$

Let

$$\begin{aligned} \lambda = \frac{[1-(s-1)\mu ] }{w_{\min }[1-\mu (s-1)] - w_{\max } \mu s }\cdot \frac{\varepsilon ^+}{\varDelta x \varDelta t}. \end{aligned}$$
(11)

Then

  1. 1)

    the support of \(\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )\) is contained in the support of \(\mathbf {a}\);

  2. 2)

    the distance between \(\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )\) and \(\mathbf {a}\) satisfies

    $$\begin{aligned} \max _j \Vert \widehat{F}[j]\Vert _{L^2} \left| \Vert \widehat{F}[j]\Vert _\infty ^{-1}\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )_j-a_j\right| \le \frac{w_{\max } + \varepsilon /\sqrt{\varDelta t \varDelta x}}{w_{\min }[1-\mu (s-1)] -w_{\max } \mu s } \varepsilon ; \end{aligned}$$
    (12)
  3. 3)

    if

    $$\begin{aligned} \min _{j:\ a_j\ne 0} \Vert \widehat{F}[j]\Vert _{L^2} | a_j| > \frac{w_{\max }+\varepsilon /\sqrt{\varDelta t \varDelta x}}{w_{\min }[1-\mu (s-1)] -w_{\max } \mu s } \varepsilon , \end{aligned}$$
    (13)

    then the support of \(\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )\) is exactly the same as the support of \(\mathbf {a}\).

Theorem 1 shows that Lasso will give rise to the correct support when the empirical feature matrix \(\widehat{F}\) is incoherent, i.e. \(\mu (\widehat{F}) \ll 1\), and all underlying coefficients are sufficiently large compared to noise. When the empirical feature matrix is coherent, i.e., some columns of \(\widehat{F}\) are correlated, it has been observed that \(\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )\) are usually supported on \(\mathrm{supp}(\mathbf {a})\) and the indices that are highly correlated with \(\mathrm{supp}(\mathbf {a})\) [9]. We select possible features by thresholding in (10) which is equivalent to \({\widehat{\varLambda }}_{\tau } : = \left\{ j : \Vert \widehat{F}[j]\Vert _{L^1} \Vert \widehat{F}[j]\Vert _{\infty }^{-1} |\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )_j| \ge \tau \right\} \) in the case of constant coefficients. After this, TEE is an effective tool in complement of Lasso to distinguish the correct features from the wrong ones. The details of Theorem 1 can be found in Appendix 1.

This analysis also gives rise to a new noise-to-signal ratio:

$$\begin{aligned} \text {Noise-to-Signal Ratio (NSR)} := \frac{\Vert \widehat{F}\mathbf {a}-{\widehat{\mathbf {b}}}\Vert _{L^2}}{\min _{j:\ a_j\ne 0} \Vert \widehat{F}[j]\Vert _{L^2} | a_j| }. \end{aligned}$$
(14)

The definition is derived from (13), showing that the signal level is contributed by the minimum of the product of the coefficient and the column norm in the feature matrix - \(\min _{j:\ a_j\ne 0} \Vert \widehat{F}[j]\Vert _{L^2} | a_j|\). This term represents the dynamics resulted from the feature. It is important to consider the multiplication rather than the magnitude of the coefficient only. We also use this new definition of NSR to measure the level of noise in the following sections, which gives a more consistent representation.

2.5 First Set of IDENT Experiments

We present the first set of numerical experiments to illustrate the effects of IDENT. Here data are sampled from exact or simulated solutions of PDEs with constant coefficients. For boundary conditions, we use zero Dirichlet boundary conditions throughout the paper. Modification to periodic or other boundary conditions is trivial, and numerical schemes with periodic boundary conditions can achieve a higher accuracy, for the cases without noise. We observe that the Lasso results are not very sensitive to the choice of \(\lambda \) using TEE, and we set \(\lambda =500\) in all experiments.

Fig. 1
figure 1

Experiment with the Burgers’ equation (15). a The given data are sampled from the true analytic solution. b The coherence pattern of \(\widehat{F}\). c Normalized coefficient magnitudes from Lasso. Two possible features are identified, which are u and \(uu_x\)

The first experiment is on the Burgers’ equation with Dirichlet boundary conditions:

$$\begin{aligned}&u_t + \left( \frac{u^2}{2}\right) _x =0, \ x \in [0,1] \nonumber \\&u(x,0)= \sin 4\pi x \text { and }u(0,t) = u(1,t) = 0. \end{aligned}$$
(15)

The given data are sampled from the true analytic solution, shown in Fig. 1a, with \(\varDelta x = 1/56\) and \(\varDelta t = 0.004\), for \(t \in [0,0.05]\). Figure 1b displays the coherence pattern of the empirical feature matrix: the absolute values of \(\widehat{F}_{\mathrm{unit}}^* \widehat{F}_{\mathrm{unit}}\) where \(\widehat{F}_{\mathrm{unit}}\) is obtained from \(\widehat{F}\) with column normalized to unit \(L^2\) norm. This pattern shows the correlation between any pair of the columns in \(\widehat{F}\). (c) shows the normalized coefficient magnitudes \(\{\Vert \widehat{F}[j]\Vert _{L^1} \Vert \widehat{F}[j]\Vert _{\infty }^{-1} |\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )_j|\}\) after \(L^1\) minimization. The magnitudes of u and \(uu_x\) are not negligible, so they are picked as a possible set of active features in \({\widehat{\varLambda }}_{\tau }\). Then, TEE is computed for all subsets \(\varOmega \subseteq {\widehat{\varLambda }}_{\tau }\), i.e., \(u_t = a u\), \(u_t = b uu_x\) and \(u_t = cu +d uu_x\) where the coefficients abcd are calculated by least-squares:

View full size image

The red line with only \(uu_x\) term has the smallest TEE, and therefore is identified as the result of IDENT. Since the true PDE is \(u_t = - u u_x\), the computed result shows a small coefficient error.

The second experiment is on the Burgers’ equation with a diffusion term:

$$\begin{aligned}&u_t + \left( \frac{u^2}{2}\right) _x =0.1u_{xx},\ x \in [0,1] \nonumber \\&u(x,0)= \sin 4\pi x \text { and }u(0,t) = u(1,t) = 0. \end{aligned}$$
(16)

The given data are simulated with a first-order explicit method where \(\delta x = 1/256\) and \(\delta t = (\delta x)^2\) for \(t \in [0,0.1]\). Data are downsampled from the numerical simulation by a factor of 4 such that \(\varDelta x = 4\delta x\) and \(\varDelta t = 4\delta t\). (We explore the effects of downsampling in more detail in Sect. 3.)

Fig. 2
figure 2

Experiment with Burgers’ equation with a diffusion term (16). a The given data are numerically simulated and downsampled. b shows that u and \(u_x u_{xx}\) are highly correlated with \(u_{xx}\) and \(u u_x\), respectively. From (c), four terms \(u, uu_x, u_{xx} \) and \(u_x u_{xx}\) are selected for TEE

Figure 2a shows the given data, (b) displays the coherence pattern of \(\widehat{F}\), and (c) shows the normalized coefficient magnitudes \(\{\Vert \widehat{F}[j]\Vert _{L^1} \Vert \widehat{F}[j]\Vert _{\infty }^{-1} |\widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )_j|\}\). In this case, the coherence pattern in (b) shows that u and \(u_x u_{xx}\) are highly correlated with \(u_{xx}\) and \(u u_x\), respectively, and therefore all four terms \(u,uu_x,u_{xx},u_x u_{xx}\) are identified as meaningful ones by Lasso in (c). Considering TEE for each subset refines these results:

View full size image

The red line is the result of IDENT, while the blue line is the ground truth. The TEE of \([uu_x \ u_{xx} \ u_x u_{xx}]\) is the smallest, which is comparable with the TEE of the true equation with \([ uu_x \ u_{xx}]\). One wrongly identified term in red, \(u_x u_{xx}\), has a coefficient magnitude of \( -1.35 \times 10^{-5}\) which is negligible. The level of error in the identification is also related to the total error to be explored in (17). Without TEE, if all four terms are used from \(L^1\) minimization, an additional wrong term u is identified with the coefficient \(-0.11\). This is comparable to other terms with coefficients like -1 or 0.1, and cannot be ignored.

Fig. 3
figure 3

Identified coefficients from Lasso (Step 2 only) versus \(\log _2\varDelta x\). In (a), as \(\varDelta t\) and \(\varDelta x\) decrease (from right to left), the coefficient of \(u u_x\) correctly converges to 1, and all other terms correctly converge to 0. In (b), as \(\varDelta t\) and \(\varDelta x\) decrease (from right to left), while the coefficients of \(u_{xx}\) and \(u u_x\) correctly converge to 0.1 and − 1 respectively, one wrong term u does not converge to 0, due to the error from data generations

Theorem 1 proves that the identified coefficients from Lasso will converge to the ground truth as \(\varDelta t \rightarrow 0\) and \(\varDelta x \rightarrow 0\) (see Eq. (8) and (12)), when there is no noise and the empirical feature matrix has a small coherence. Figure 3 shows the recovered coefficients from Lasso versus \(\varDelta t\) and \(\varDelta x\) for the Burgers’ equation (15) and Burgers’ equation with diffusion (16). In Fig. 3a, data are sampled from the analytic solution of the Burgers’ equation (15) with spacing \(\varDelta x = 2^{k}\) for \(k=-12, -11, \ldots , -5\) respectively and \(\varDelta t = \varDelta x\) for \(t \in [0,0.05]\). Figure 3a shows the result from Lasso, namely, \(\{ \Vert \widehat{F}[j]\Vert _{\infty }^{-1} \widehat{\mathbf {a}}_{\text {Lasso}}(\lambda )_j\}\), versus \(\log _2 \varDelta x\). Notice that the coefficient of \(u u_x\) converges to \(-1\) and all other coefficients converge to 0 as \(\varDelta t\) and \(\varDelta x\) decrease. For Fig. 3b, data are sampled from the numerical simulation of the Burgers’ equation with diffusion in (16), where the PDE is solved by a first-order method with \(\delta x = 2^{-10}\) and \(\delta t = (\delta x)^2\) for \(t \in [0,0.1]\). Data are sampled with \(\varDelta x = 2^{-10},2^{-9},2^{-8},2^{-7},2^{-6}\) respectively, and \(\varDelta t = (\varDelta x)^2\) for \(t \in [0,0.1]\). Figure 3b shows the recovered coefficients from Lasso versus \(\log _2 \varDelta x\). Here the coefficients of \(u u_x\) and \(u_{xx} \) converge to \(-1\) and 0.1 respectively, and all other coefficients except the one of u, converge to 0, as \(\varDelta t\) and \(\varDelta x\) decrease. The coefficient of u does not converge to 0 because data are generated by a first order numerical scheme with the error \(O[\delta t + (\delta x)^2]\), i.e., the error for Lasso \(\Vert \mathbf {e}\Vert _{L^2}\) does not decay to 0 as \(\varDelta t\) and \(\varDelta x\) decrease. We further discuss this aspect of data generations in Sect. 3.

3 Noisy Data, Downsampling and IDENT

As noticed above, identification results depend on the accuracy of the given data. In this section, we explore the effects of inaccuracy in data generations, noise and downsampling. We derive an error formula to incorporate the errors arising from these three aspects, which provides a theoretical guidance of the difficulty of identification.

The given data \(\{{\widetilde{u}}_i^n\}\) may contain noise, such that

$$\begin{aligned} {\widetilde{u}}_i^n = u_i^n + \xi _i^n, \end{aligned}$$

where the noise \(\xi _i^n\) arises from inaccuracy in data generations and/or the measurement error. Consider a rth order PDE with the highest-order spatial derivative \(\partial _x^r u\). Suppose data are numerically simulated by a qth-order method with time step \(\delta t\) and spatial spacing \(\delta x\), and the measurement error is independently drawn from the normal distribution with mean 0 and variance \(\sigma ^2\). Then

$$\begin{aligned} \xi _i^n = O(\delta t + \delta x^q + \sigma ). \end{aligned}$$

We use the five-point ENO method to approximate the spatial derivatives in the empirical feature matrix \(\widehat{F}\) in Sect. 2. In general, one could interpolate the data with a pth-order polynomial and use the derivatives of the polynomial to approximate \(u_x\) and \(u_{xx}\), etc. In this case, the error for the kth-order spatial derivative \(\partial _x^k u\) is \(O(\varDelta x^{p+1-k})\).

The error for Lasso given by (7) is \(\mathbf {e} = \mathbf {b}-\widehat{\mathbf {b}} + (\widehat{F}-F)\mathbf {a}+\varvec{\eta }\), where \(\mathbf {b}-\widehat{\mathbf {b}}\) is from the approximation of \(u_t\), \((\widehat{F}-F)\mathbf {a}\) is from the approximation of the spatial derivatives of u, and \(\varvec{\eta }\) arrises from the finite element basis expansion for the varying coefficients. If \(u,u_x,u_{xx},\ldots \) are bounded, these terms satisfy

$$\begin{aligned} \Vert (\widehat{F}-F)\mathbf {a}\Vert _\infty&\le O\left( \varDelta x^{p+1-r} + \frac{\delta t + \delta x^q +\sigma }{\varDelta x^r}\right) \text { and }\\ \Vert \mathbf {b}-\widehat{\mathbf {b}}\Vert _\infty&\le O\left( \varDelta t + \frac{\delta t + \delta x^q +\sigma }{\varDelta t}\right) , \end{aligned}$$

and \(\Vert \varvec{\eta }\Vert _\infty = O\left( 1/ L \right) \) so that

$$\begin{aligned} \Vert \mathbf {e}\Vert _{L^2} \le \varepsilon , \text { with } \varepsilon = O\left( \varDelta t + \varDelta x^{p+1-r}+ \underbrace{\frac{\delta t + \delta x^q}{\varDelta t}+ \frac{\delta t + \delta x^q }{\varDelta x^r}}_{\text {errors from data generations}} + \underbrace{\frac{\sigma }{\varDelta t} + \frac{\sigma }{\varDelta x^r}}_{\text {measurement noise}} + \frac{1}{L}\right) . \end{aligned}$$
(17)

This error formula suggests the followings:

Sensitivity to measurement noise::

Finite differences are sensitive to measurement noise since Gaussian noise with mean 0 and variance \(\sigma ^2\) results in \(O(\sigma /\varDelta t + \sigma /\varDelta x^r)\) in the error formula. Higher-order PDEs are more sensitive to measurement noise than lower-order PDEs. Denoising the given data is helpful to Lasso in general.

Downsampling of data::

In applications, the given data are downsampled such that \(\varDelta t = C_t \delta t\) and \(\varDelta x = C_x \delta x\) where \(C_t\) and \(C_x\) are the downsampling factors in time and space. Downsampling could help to reduce the error depending on the balances among the terms in (17).

We further explore these effects below. We propose an order preserving denoising method in Sect. 3.1, experiment IDENT with noisy data in Sect. 3.2, and discuss the downsampling of data in Sect. 3.3.

3.1 An Order Preserving Denoising Method: Least-Squares Moving Average

Our error formula in (17) shows that a small amount of noise can quickly increase the complexity of the recovery, especially for higher-order PDEs. Denoising is helpful in general. We propose an order preserving method which keeps the order of the approximation to the underlying function, while smooths out possible noise.

Let the data \(\{d_i\}\) be given on a one-dimensional uniform grid \(\{x_i\}\) and define its five-point moving average as \(\tilde{d_i} = \frac{1}{5}\sum _{l=0,\pm 1, \pm 2} d_{i+l}\) for all i. At each grid point \(x_i\), we determine a quadratic polynomial \(p(x)= a_0 + a_1 (x-x_i) + a_2 (x-x_i)^2\) fitting the local data, which preserves the order of smoothness, up to the degree of polynomial. There are a few possible choices for denoising, such as (i) Least-Squares Fitting (LS): find \(a_0, a_1\) and \(a_2\) to minimize the functional \( F( a_0, a_1, a_2 )=\sum _{\mathrm{some} \; j\; \mathrm{near} \; i} (p(x_j)- d_j)^2\); (ii) Moving-Average Fitting (MA): find \(a_0, a_1\) and \(a_2\), such that the local average of the fitted polynomial matches with the local average of the data, \( {1}/{5} \sum _{l=0,\pm 1, \pm 2} p(x_{j+l}) = \tilde{d_j},\) for \(j=i, i\pm 1\) (or another set of 3 grid points near \(\{x_i\}\)). The polynomial generated by LS may not represent the underlying true dynamics. Moving average fitting is better in keeping the underlying dynamics, however, the matrix may be ill-conditioned when a linear system is solved to determine \(a_0,a_1,a_2\).

We propose to use (iii) Least-Squares Moving Average (LSMA): find \(a_0, a_1\) and \(a_2\) to minimize the functional

$$\begin{aligned}G( a_0, a_1, a_2 )= \sum _{j=i,i\pm 1, i\pm 2} \left\{ \left[ \frac{1}{5} \sum _{l=0,\pm 1, \pm 2} p(x_{j+l})\right] - \tilde{d_j}\right\} ^2.\end{aligned}$$

The condition number of this linear system tends to be better in comparison with MA, because j is chosen from a larger set of indices. This LSMA denoising method preserves the approximation order of data and can easily be incorporated into numerical PDE techniques. MA fitting and LSMA are similar to the non-oscillatory polynomial reconstruction from cell averages which is a key step in high-resolution shock capturing schemes, see e.g. [1, 11, 12]. The quadratic polynomials computed by the methods above are locally third-order approximation to the underlying function. We prove that, if the given data are sampled from a third-order approximation to a smooth function, then LSMA will keep the same order of the approximation. This theorem can be easily generalized to any higher order, we kept to 3rd order to be consistent with our experiments in this paper.

Theorem 2

If data are given as a 3rd order approximation to a smooth function, with or without additive noise, then denosing the data (to obtain a piecewise quadratic function) with the Least-Squares Moving Average (LSMA) method will keep the same order of accuracy to the function.

Proof

Let f(x) be the smooth function. The proof can be done by comparing the quadratic function to that of the Taylor expansion of f(x) at a grid point \(x_{i_0}\), see e.g., [15]. Let \(p(x)=a_0+a_1(x-x_{i_0})+a_2(x-x_{i_0})^2\) be the quadratic function to be determined near \(x_{i_0}\). The least-squares method solves the linear system \(A^T(Ac-b)=0\) for the coefficient vector \(c=[a_0, a_1,a_2]^T \), where A is a \(5\times 3\) matrix whose rows can be written as \([1, \frac{1}{5} \sum _{j=0,\pm 1, \pm 2}(x_{i+j}-x_{i_0}), \frac{1}{5} \sum _{j=0,\pm 1, \pm 2}(x_{i+j}-x_{i_0})^2], \) for \(i=i_0-2, \ldots , i_0+2\), and \(b=[{\tilde{d}}_{i_0-2},\ldots , {\tilde{d}}_{i_0+2}]^T.\)

According to the assumption we have

$$\begin{aligned} \tilde{d_i}= & {} f(x_{i_0})+f'(x_{i_0})\frac{1}{5} \sum _{j=0,\pm 1, \pm 2}(x_{i+j}-x_{i_0})\\&\quad + \frac{1}{2} f''(x_{i_0})\frac{1}{5} \sum _{j=0,\pm 1, \pm 2}(x_{i+j}-x_{i_0})^2+O(\varDelta x^3), \end{aligned}$$

for any grid point \(x_i\) near \(x_{i_0}\), i.e., \(|x_i-x_{i_0}|= O(\varDelta x)\). Let \(s=[f(x_{i_0}), f'(x_{i_0}), \frac{1}{2} f''(x_{i_0})]^T\). We have

$$\begin{aligned} A^T(Ac-b)= HB^T\{BH(c-s)+O(\varDelta x^3)\}, \end{aligned}$$

where H is the \(3\times 3\) diagonal matrix \(\mathrm{diag}\{1, \varDelta x, \varDelta x^2\} \), and

$$\begin{aligned} B=A H^{-1}=\left[ \begin{array}{lcr} &{}\vdots &{} \\ 1 &{} \frac{1}{5} \sum _{j=0,\pm 1, \pm 2}\frac{x_{i_0+j}-x_{i_0}}{\varDelta x} &{} \frac{1}{5} \sum _{j=0,\pm 1, \pm 2}\frac{(x_{i_0+j}-x_{i_0})^2}{\varDelta x^2} \\ &{}\vdots &{} \end{array} \right] . \end{aligned}$$

Therefore \(H(c-s)= (B^TB)^{-1}B^T\cdot O(\varDelta x^3)\). Note that B is independent of \(\varDelta x\), we have \(|p(x)-f(x)|= O(\varDelta x^3)\) for all x such that \(|x-x_{i_0}|= O(\varDelta x)\). \(\square \)

3.2 IDENT Experiments for Noisy Data

We next present numerical experiments with noisy data. We say P percent Gaussian noise is added to the noise-free data \(\{u_i^n : i=1,\ldots ,N_1 \text { and } n=1,\ldots ,N_2\}\), if the observed data are \(\{{\widetilde{u}}_i^n\}\) where \({\widetilde{u}}_i^n = u_i^n + \xi _i^n\) and \(\xi _i^n \sim \mathcal {N}(0,\sigma ^2)\) with \(\sigma = \frac{P}{100} \sqrt{\sum _{i=1}^{N_1} \sum _{n=1}^{N_2} |u_i^n|^2}/\sqrt{N_1 N_2}\).

Fig. 4
figure 4

Burgers’ equation in (15) with 8% Gaussian noise. (a) Given noisy data, (b) Coherence pattern of the feature matrix. (c) The normalized coefficient magnitudes from Lasso. This fails to identify the correct term \(uu_x\)

Our first experiment is on the Burgers’ equation in (15) with 8% Gaussian noise. Data are sampled from the analytic solution with \(\varDelta x = 1/56\) and \(\varDelta t = 0.004\) for \(t \in [0,0.05]\), and then 8% Gaussian noise is added. For comparison, we do not denoise the given data, but directly applied IDENT. Figure 4a shows the noisy given data, (b) shows the coherence pattern, and (c) shows the normalized coefficient magnitudes from Lasso. The NSR for Lasso defined in (14) is 3.04. Lasso fails to include the correct set of terms, thus TEE identifies the wrong equation \(u_t=-0.59u^2\) as a solution.

View full size image

For the same data, Fig. 5 and the table below show the results when LSMA denoising is applied. After denoising, and the given data is noticeably smoother when Fig. 5a is compared with Fig. 4a. The Lasso result shows great improvement in Fig. 5c. With the correct terms included in the Step 2 of Lasso, TEE determines the PDE with correct feature: \(u_t = -0.92 u u_x\).

View full size image

Fig. 5
figure 5

Burgers’ equation in (15) with 8% Gaussian noise as in Fig. 4. a The data after the LSMA denoising. b Coherence pattern of \(\widehat{F}\). c The normalized coefficient magnitudes from Lasso identifies 1, u and \(uu_x\) which include the correct term \(u u_x\)

In the next set of experiments, we explore different levels of noise for denosing+IDENT. In Fig. 6, we experiment on the Burgers’ equation (15) with its analytic solution sampled in the same way as above, while the noise level increases from 0 to 30%. For each noise level, we (i) first generate data with 100 sets of random noises, (ii) denoise by LS and LSMA for a comparison respectively, then (iii) run IDENT. The parameter \(\tau \) is chosen as 10% of the largest coefficient magnitude. Figure 6a represents how likely wrong results are found. It is computed by the average ratio between the wrong coefficients and all computed coefficients: \(\textstyle \sum _{j \in \widehat{\varLambda }\setminus \varLambda } |\widehat{\mathbf {a}}_j|/\Vert \widehat{\mathbf {a}}\Vert _{1}\), where \(\varLambda \) and \(\widehat{\varLambda }\) are the exact support and the identified support, respectively. Each bar plot represents the standard deviation of the results among 100 trials. The green curves denoised by LSMA show the most stable results even as the noise level increases. Figure 6b shows the recovered coefficient of \(uu_x\), where the true value is \(-1\). Notice that the LSMA+IDENT (green points) results are closer to \(-1\), while others find wrong coefficients more often. In general, denoising the given data with LSMA improves the result significantly.

Fig. 6
figure 6

Burgers’ equation (15) with increasing noise levels. a The average ratio between the identified wrong coefficients and all identified coefficients over 100 trails. b The recovered coefficient of \(u u_x\) by IDENT. Denoising the given data with LSMA significantly improves the result. The table shows the new NSR (14) corresponding to the noise level given in percentage

Fig. 7
figure 7

Burgers’ equation with diffusion in (16) with varying noise levels. a The average ratio between the identified wrong coefficients and all identified coefficients over 100 trails. b, c the computed coefficients of \(u u_x\) and \(u_{xx}\) respectively by IDENT. While the noise level in percentage seems small, the new NSR represents the severeness of noise for PDE with high order derivatives

Figure 7 shows the Burgers’ equation with diffusion in (16) with varying noise levels. The given data are sampled in the same way as in Fig. 2, the noise level increases from 0 to 0.12%. (a) shows the average ratio between the wrong coefficients and the total coefficients. (b) and (c) show the recovered coefficients of \(uu_x\) and \(u_{xx}\), respectively. Again using LSMA shows better performance.

For both Figs. 6 and 7, we present the new NSR defined in (14). This clearly presents that noise affects different PDEs in different ways. The Burgers’ equation (15) only have first order derivatives, while the Burgers’ equation with diffusion in (16) has a second order derivative. This seemingly small difference makes a big impact on the NSR and identification. While in Fig. 6, the noise level is experimented up to 30%, its corresponding new NSR varies only from 0 to less than 3.5. In Fig. 7, the noise level only varies from 0 to 0.12 in percentage, however, this corresponds to new NSR varying form 0 to above 12. The level of the new NSR characterizes the difficulty of identification using IDENT (Step 2, Lasso), since having a higher-order term affects the Lasso negatively, especially in the presents of noise.

3.3 Downsampling Effects and IDENT

In applications, data are often collected on a coarse grid to save the expenses of sensors. We explore the effect of downsampling in data collections in this section. Consider a rth order PDE. Simulating its solution with a qth order method on a fine grid with time step \(\delta t \) and spatial spacing \(\delta x\) gives rise to the error \(O(\delta t + \delta x^q)\). Suppose data are downsampled by a factor of \(C_t\) in time and \(C_x\) in space, such that data are sampled with spacing \(\varDelta t = C_t \delta t\) and \(\varDelta x = C_x \delta x\). Our error formula in (17) is crucially dependent on the downsampling factors \(C_t\) and \(C_x\). Each term is affected by downsampling differently.

  • The term \(\varDelta t + \varDelta x^{p+1-r}\) arises from the approximation of time and spatial derivatives. It increases as the downsampling factors \(C_t\) and \(C_x\) increase.

  • The term \(\frac{\delta t + \delta x^q}{\varDelta t} + \frac{\delta t + \delta x^q}{\varDelta x^r}\) arises from the error in data generations. It decreases as the downsampling factors \(C_t\) and \(C_x\) increase.

  • The term \(\frac{\sigma }{\varDelta t} + \frac{\sigma }{\varDelta x^r}\) arises from the measurement noise. It decreases as the downsampling factors \(C_t\) and \(C_x\) increase.

Therefore, downsampling may positively affect the identification depending on the balance among these three terms.

Fig. 8
figure 8

Burgers’ equation in (15) with various downsampling factors. a, b show the average ratio between the identified wrong coefficients and all identified coefficients in 100 trails versus \(\log _2(\text {downsampling factor})\) in the presence of 5% (left) and 10% (right) noise respectively. Increasing the downsampling factors can positively affects the result until the downsampling factors become too large

As a numerical example, we consider the Burgers’ equation in (15) with different downsampling factors. The analytic solution is evaluated on the grid with spacing \(\delta x = 1/1024\) and \(\delta t = 0.001\) for \(t \in [0,0.05]\). After evaluating the analytic solution, we generate 100 sets of random noises then downsample the noisy data with spacing \(\varDelta x = C_x \delta x\) and \(\varDelta t = C_t \delta t\) where \(C_x = C_t = 1,2,2^2,2^3,2^4\), and \(2^5\) respectively. We run IDENT on the given downsampled noisy data, denoised by LS and LSMA respectively. Figure 8 displays the ratio of wrong coefficients by IDENT versus \(\log _2 C_x\) in the presence of 5% or 10% Gaussian noise. We observe that increasing downsampling rates can positively affect the result until the downsampling rates become too large. LSMA also gives the best performance.

4 Varying Coefficients and Base Element Expansion

In this section, we consider PDEs with varying coefficients, e.g., \(a_j(x)\) varying in space. As illustrated in (4), we can easily generalize the IDENT set-up to PDEs with varying coefficients, by expanding the coefficients in terms of finite element bases and solving group Lasso for \(L>1\). Due to the increasing number of coefficients, the complexity of the problem increases as L increases. In order to design a stable algorithm, we propose to let L grow before TEE is applied.

We refer to this extra procedure as Base Element Expansion (BEE). From the given discrete data \( \{u_{i}^n | i=1, \ldots , N_1 \text { and } n= 1, \ldots , N_2\}\), we first compute numerical approximations of \(u_t,u_x,u_{xx}\), etc, then apply BEE to gradually increase L until the recovered coefficients become stable. For each fixed L, we form the feature matrix \(\widehat{F}\) according to (6), and solve group Lasso with the balancing parameter \(\lambda \) to obtain \(\widehat{\mathbf {a}}_{\text {G-Lasso}}(\lambda )\). We record the normalized block magnitudes from group Lasso, as L increases:

$$\begin{aligned} \text { BEE procedure}:= \left\{ \Vert \widehat{F}[j]\Vert _{L^1}\left\| \sum _{l=1}^L \frac{\widehat{\mathbf {a}}_{\text {G-Lasso}}(\lambda )_{j,l}}{\Vert \widehat{F}[j,l]\Vert _\infty } \phi _l \right\| _{L^1}\right\} _{j=1,\ldots ,N_3} \text { versus } L. \end{aligned}$$

The main idea of BEE is based on the convergence of the finite element approximation (5) - as more basis functions are used, the more accurate the approximation is. In the BEE procedure, the normalized block magnitudes reach a plateau as L increases, i.e., candidate features can be selected by a thresholding according to (10) when L is sufficiently large. With this added BEE procedure, IDENT continues to the Step 3 of TEE to refine the selection.

In the following, we present various numerical experiments for PDEs with varying coefficients using IDENT with BEE. For the first set of experiments, in Figs. 9, 10 and 11, we assume only one coefficient is known a priori to vary in x. For the second set of experiments, in Fig. 12, we assume two coefficients are known a priori to vary in x, and the final experiment, in Fig. 13 assumes all coefficients are free to vary without any a priori information.

Fig. 9
figure 9

Burgers’ equation with a varying diffusion coefficient (18) where data are downsampled by a factor 4. a The given data. b BEE as L increases from 1 to 30. c An example of the magnitudes of coefficients from Group Lasso when \(L=20\). d TEE versus L, for all subsets of coefficients of \(\{uu_x \ u_{xx} \}\). e Recovered diffusion coefficient \(\widehat{c}(x) = \sum _{l=1}^L \widehat{a}_{7,l}\phi _l(x)\) when \(L=20\) (blue), compared with the true diffusion coefficient c(x) (red). f The error \(\Vert c(x)-\widehat{c}(x)\Vert _{L^1}\) as L increases from 1 to 30

The first experiment is on the Burgers’ equation with a varying diffusion coefficient:

$$\begin{aligned}&u_t + \left( \frac{u^2}{2}\right) _x =c(x)u_{xx}, \text { where } c(x) = 0.05+0.2 \sin \pi x \nonumber \\&\ x \in [0,1], \ u(x,0)= \sin (4\pi x)+0.5\sin (8 \pi x) \text { and } u(0,t) = u(1,t) = 0. \end{aligned}$$
(18)

The given data, shown in Fig. 9a, is numerically simulated by a first-order method with spacing \(\delta x = 1/256\) and \(\delta t = (\delta x)^2/2\) for \(t \in [0,0.05]\). Data are downsampled by a factor of 4 in time and space. There is no measurement noise. Our objective is to identify the correct features \(u u_x\) and \(u_{xx}\) and recover the coefficients \(-1\) and the varying coefficient c(x). After we expand the diffusion coefficient with L finite element bases, the vectors to be identified can be written as \(\mathbf {a}= [a_1 \ldots a_6 \ a_{7,1} \ \ldots \ a_{7,L} \ a_8 \ldots a_{10}]^T\) where \(c(x) \approx \sum _{\ell =1}^L a_{7,l}\phi _l(x)\).

Figure 9b presents BEE as L increases from 1 to 30. This graph clearly shows that BEE stabilizes when \(L \ge 5\). (c) is an example of the group Lasso result, the normalized block magnitudes, when \(L = 20\). The magnitudes of \(uu_x\) and \(u_{xx}\) are significantly larger than the others, so they are picked for TEE in Step 3. Figure 9d presents TEE for all different L: TEE of \(uu_x\), \(u_{xx}\) and \([uu_x \ u_{xx}]\) in \(\log _{10}\) scale, respectively. The correct set, \([uu_x \ u_{xx}]\), is identified as the recovered feature set with the smallest TEE. The coefficient \([\widehat{a}_6 \ \widehat{a}_{7,1} \ldots \widehat{a}_{7,L}]\) is computed by least squares, and (e) displays the recovered diffusion coefficient \(\widehat{c}(x) = \sum _{l=1}^L \widehat{a}_{7,l}\phi _l(x)\) when \(L=20\), compared with the true equation \(c(x) = 0.05+0.2 \sin \pi x\) given in (18). Figure 9f shows the error \(\Vert c(x)-\widehat{c}(x)\Vert _{L^1}\) as L increases from 1 to 30. The error decreases as L increases, yet does not converge to 0 due to the errors arising from data generations and finite-difference approximations of \(u_t,u_x\) and \(u_{xx}\).

For the same equation, \(0.2\%\) noise is added to the next experiments presented in Fig. 10 and 11. Figure 10 presents the result without any denoising. (a) shows BEE as L increases from 1 to 30, where the magnitudes of \(u,uu_x,u_{xx},u_xu_{xx}\) are not negligible for \(L \ge 20\). These terms are picked for TEE, and Fig. 10b shows TEE versus L. The correct support \([uu_x \ u_{xx}]\) is identified with the lowest TEE when \(L \ge 7\). The computed diffusion coefficient \(\widehat{c}(x)\) is compared to the true one in (c), which has the error \(\Vert c(x)-\widehat{c}(x)\Vert _{L^1} \approx 0.019\). Even for the data with noise, IDENT+BEE without any denoising gives a good identification of the general form of the PDE. However, the varying coefficient approximation can be improved if LSMA denoising applied to the data as discussed in Sect. 3.

Figure 11 presents the same experiment with LSMA denoising. In (a), BEE picks \(u,uu_x,u_{xx}\) for TEE. Notice that the coefficient of \(u_x u_{xx}\) almost vanishes after denoising compared to Fig. 10. (b) shows TEE versus L, where the correct support \([uu_x \ u_{xx}]\) gives the lowest TEE, when \(L \ge 19\). The recovered diffusion coefficient when \(L=20\) is shown in (c), which yields the error \(\Vert c(x)-\widehat{c}(x)\Vert _{L^1} \approx 0.008\). In comparison with the results in Fig. 10 without denoising, LSMA reduces the error of the recovered diffusion coefficient from 0.019 to 0.008.

Fig. 10
figure 10

Equation(18), where data are downsampled by a factor of 4 and 0.2% measurement noise is added. No denoising is applied. a BEE as L increases from 1 to 30. b TEE versus L, for all subsets of four terms selected in (a). The correct support \([uu_x \ u_{xx}]\) is identified with the lowest TEE when \(L \ge 7\). c Recovered diffusion coefficient \(\widehat{c}(x)\) when \(L=20\) (blue), compared with the true diffusion coefficient c(x) (red)

Fig. 11
figure 11

The same experiment as Fig. 10, but IDENT+BEE is applied with LSMA denoising. a BEE as L increases from 1 to 30. b TEE versus L, for all subset of coefficients identified in (a). The correct support \([uu_x \ u_{xx}]\) is identified since it gives rise to the lowest TEE when \(L \ge 19\). c The recovered diffusion coefficient when \(L=20\), compared with the true diffusion coefficient (red), which shows a clear improvement compared to Fig. 10c

In Fig. 12, we experiment on the following PDF with two varying coefficients:

$$\begin{aligned}&u_t =b(x)u_x + c(x)u_{xx}, \text { where } b(x)=-2x \text { and } c(x) = 0.05+0.2 \sin \pi x, \nonumber \\&x \in [0,1], \ u(x,0)= \sin (4\pi x)+0.5\sin (8 \pi x) \text { and } u(0,t) = u(1,t) = 0. \end{aligned}$$
(19)

The given data are simulated by a first-order method with spacing \(\delta x = 1/256\) and \(\delta t = (\delta x)^2/2\) for \(t \in [0,0.05]\). The vectors to be identified are \(\mathbf {a}= [a_1 \ldots a_3 \ a_{4,1} \ \ldots \ a_{4,L} \ a_5 \ a_6 \ a_{7,1} \ \ldots \ a_{7,L} \ a_8 \ldots a_{10}]^T\) where \(b(x) \approx \sum _{\ell =1}^L a_{4,l}\phi _l(x)\) and \(c(x) \approx \sum _{\ell =1}^L a_{7,l}\phi _l(x)\). Figure 12a shows the numerical solution of (19). In (b) the given data are downsampled by a factor of 4 in time and space, and in (c) data not downsampled. BEE and TEE successfully identifies the correct features. Figure 12b plots both the recovered coefficients \(\widehat{b}(x)\), \(\widehat{c}(x)\) and the true coefficients b(x) and c(x) when data are downsampled. Notice that the coefficient \(\widehat{b}(x)\) of \(u_x\) is not accurate when x is close to 1. The result is improved in (c) where data are not downsampled. No downsampling helps to keep details of the solution around \(x=1\) and reduces the finite-difference approximation errors.

Fig. 12
figure 12

Equation (19) with varying convection and diffusion coefficients. a The numerical solution of (19). b With data downsampled by a factor of 4 in time and space, the recovered coefficient \(\widehat{b}(x)\) of \(u_x\) is not accurate near \(x=1\). The downsampling rate is too high near \(x=1\) so that details of the solution are lost. c The same experiment without any downsampling, and the recovered coefficients \(\widehat{b}(x)\) and \(\widehat{c}(x)\) are more accurate than (b)

Our final experiment is on Eq. (18), but all coefficients are allowed to vary in x. The numerical solution is simulated in the same way as Fig. 9 and the given data are downsampled by a factor of 4 in time and space. After all coefficients are expanded in terms of L finite element bases, the vectors to be identified is \(\mathbf {a}= \{a_{k,l}\}_{k=1,\ldots ,10,\ l=1,\ldots ,L}\) where \(-1 = b(x) \approx \sum _{l=1}^L a_{4,l}\phi (x)\) and \(c(x) \approx \sum _{\ell =1}^L a_{7,l}\phi _l(x)\). Figure 13a shows BEE, (b) shows group Lasso result, and (c) shows TEE. TEE identifies the correct support \([u_x \ u_{xx}]\) since it yields the smallest error. The coefficients \([\widehat{a}_{6,1} \ \ldots \ \widehat{a}_{6,L} \ \widehat{a}_{7,1} \ldots \widehat{a}_{7,L}]\) is computed by least squares. Figure 13d displays the computed coefficients \(\widehat{b}(x) = \sum _{l=1}^L \widehat{a}_{6,l}\phi _l(x)\) and \(\widehat{c}(x) = \sum _{l=1}^L \widehat{a}_{7,l}\phi _l(x)\) when \(L=20\), and (e) shows the coefficient recovery errors \(\Vert -1-\widehat{b}(x)\Vert _{L^1}\) and \(\Vert c(x)-\widehat{c}(x)\Vert _{L^1}\) as L increases from 1 to 30. IDENT with BEE successfully identifies the correct terms even when all coefficients are free to vary in space. The accuracy of the recovered coefficients has room for improvement if data are simulated and sampled on a finer grid.

Fig. 13
figure 13

Equation (18) where we are a priori given that all coefficients are free to vary with respect to x. Data are downsampled by a factor of 4 in time and space

5 Concluding Remarks

We proposed a new method to identify PDEs from a given set of time dependent data, with numerical PDE techniques and the fundamental convergence principle of numerical PDE schemes. Assuming that the PDE contains few active terms in a prescribed dictionary, we use finite differences, such as the ENO scheme, to form an empirical approximation of the dictionary, and utilize \(L^1\) minimization to efficiently select a candidate set. Time Evolution Error (TEE) was proposed as an effective tool to pick the correct set of terms.

Starting from the first set of basic experiments, we considered noisy data, downsampling effects and PDEs with varying coefficients. A recovery theory of Lasso for PDE identification was established, where a new noise-to-signal ratio (14) was proposed to measure the noise level more accurately in the setting of PDE identification. We derived an error formula in (17) and analyzed the effects of noise and downsampling. A new order preserving denoising method called LSMA was proposed in Sect. 3.1 to aid the identification with noisy data. IDENT can be applied to PDEs with varying coefficients and a BEE procedure helps to stabilize the results and reduce the computational cost.

In this set-up of the problem, we consider a specific set of discretization in (1), but our proposed method is general. TEE uses the principle of numerical convergence, and it will work for any consistent and stable set of discretization as effectively as the current one. Typically denoising of data introduces biases to the learning process; yet, it also greatly reduces the variances and helps to identify the correct set of coefficients. The final result is a balance between the biases and the variances. Also, TEE overcomes the biases, since TEE finds the PDE whose evolution best matches the given data. The biases introduced from denoising do not have time dependency, and TEE helps to reduce the biases from denoising.

A recovery theory of Lasso for PDE recovery was established in Theorem 1 in terms of an incoherence condition. In the compressive sensing community, the matrices that are proved to satisfy this incoherence condition are mostly random matrices. In the PDE identification setting, the feature matrix results from a PDE, and is deterministic. The mutual coherence of the feature matrix is usually close to 1, which violates the incoherence condition. However, the coherence pattern does provide a lot of empirical guidance. When some columns of the feature matrix are highly correlated, the recovered support by Lasso includes the true support and the indices that are highly correlated with the true support. Therefore, our proposed Time Evolution Error (TEE) plays a crucial role in ruling out the wrong terms and picking the correct support. Theorem 1 also gives insight to a correct definition of the Noise-to-Signal Ratio for PDE identification.

Since our problem set-up only utilizes data from one realization of the PDE with one initial condition, it is possible that IDENT identifies a different PDE instead of the true one, when the given data are not sufficient. IDENT will find the PDE, which gives the smallest TEE. When a small amount of data is given, IDENT can find a simpler PDE which may not be the true one, but gives the smallest TEE well representing the given data set. IDENT will not identify any ill-posed PDE, since an ill-posed PDE will blow up the TEE and will not be chosen by IDENT.

The idea of selecting a candidate set by Lasso and then refining the results was also considered in [20], where the authors used a well-known statistical AIC test to find the best equation, which minimizes certain loss function. IDENT utilizes numerical time evolution, which focuses on the evolution of PDE and better captures the time dependent dynamics. In practice, we observed that, for noisy data and/or nonlinear high order PDEs, TEE performs well.

Our proposed method can be compared to many existing methods in the following way: First, this method does not require large training data sets and only uses one realization of the PDE, while a large training set is required in the deep learning approach [14, 17, 19, 21, 21,22,23,24]. Second, by using various numerical PDE schemes, (i) our method is flexible with the boundary conditions, instead of only considering periodic boundary conditions [26]; (ii) it can handle noisy data—which is considered to be very challenging in this area. Third, this is the first work to address varying coefficients, which opens up many new PDEs to be identified.