1 Introduction

System identification is often the first and critical step in system analysis, design, simulation, and control. In the literature, there exist a huge number of papers as well as various well-developed algorithms for linear system identification [11, 18, 28]. Despite a long history and practical demands, nonlinear system identification is far from mature both in theory and in practice [15, 21, 27, 29]. Because the structure of nonlinear systems is so rich, it is not expected that a single method could be effectively applied to all nonlinear systems. Instead, various identification methods have to be developed for different classes of nonlinear systems and for different intended purposes.

Roughly speaking, nonlinear system identification can be divided into two categories depending on available a priori information on the structure of the unknown system. If the structure of the unknown system is available a priori, the identification problem is reduced to a parameter estimation problem, essentially a nonlinear minimization problem. Issues are how to find a minimum and if the obtained minimum is a global minimum. The other category is that no a priori information is available on the structure. This is a much harder problem. Traditional ways to approach this problem are the Volterra and Wiener series representations [25]. They are elegant in theory but applications are often limited. For the Volterra series, its application is limited to very low-order kernels because the number of unknown parameters to be estimated increases exponentially. Further, identification has to be repeated every time when an additional kernel is deemed necessary and is added. For the Wiener series, the input is usually assumed to be Gaussian. For both the Volterra series and the Wiener series, the basic idea is a multivariable polynomial approximation of the unknown system and thus, a very high-order model is needed to be able to approximate the true but unknown nonlinear system. This makes them practically intractable unless the unknown system is close to a polynomial of low order. To overcome this problem, a fixed basis function approach developed for linear systems [23, 31] has been investigated and applied for nonlinear system identification with some success [10, 16, 30]. Typical basis functions are Fourier series, polynomials, and some orthogonal functions. In particular, orthogonal functions are very attractive because no previously obtained terms have to be reestimated when an additional term is added. Only the added term needs to be estimated. Clearly, success of the orthogonal basis function approach relies on the fact that a nonparametric nonlinear identification problem is reduced to a parametric parameter estimation problem and moreover, estimations of each term are separable in some sense. On the other hand, however, its advantage is also its weakness. Performance of an orthogonal basis function approach, like any basis function approach, depends on whether the chosen basis functions resemble the structure of the unknown nonlinear system. Without enough a priori information on the structure, a fixed basis function approach often requires a large number of terms to be able to reasonably approximate the true but unknown nonlinear system which has a considerable negative effect on the identification performance. Some ideas, e.g., tunable basis functions, are proposed in the literature including wavelets, neural network, fuzzy, etc [14, 33, 34]. Even with these tunable basis functions, adequate a priori information on the structure is still needed so that the tunable basis functions are rich enough to capture the unknown system. There is an additional difficulty with such tunable basis function approaches, i.e., minimization could be trapped in a local minimum.

In this work, we propose a data-driven basis function approach to nonlinear system identification. The basis functions are not fixed but are data generated as a part of identification. The basis functions are chosen as a result of identification and automatically match the structure of the unknown nonlinear system. This eliminates the problem of blindly guessing basis functions without knowing the structure of the unknown nonlinear system. Further, the chosen basis functions are orthogonal and when it is determined that an additional term is needed, all the previously calculated terms remain unchanged and only the added term has to be identified. This is particularly useful since the order and the structure of the nonlinear system are unknown and have to be determined as a part of identification.

The main contribution is a framework that uses the data-driven orthogonal basis functions for nonparametric nonlinear system identification. The chosen orthogonal functions always match the system even when the system is unknown and very little a priori information on the structure of the unknown system is assumed. This is different from the existing literature where a fixed basis function is used for system identification. The work is motivated by [2, 26] though the driving force is completely different. In addition, approaches are proposed for model order determination and regressor selection. The first one is the combined residual analysis and modified Box–Pierce hypothesis test approach. It is known in the literature that the popular Box–Pierce test extensively used in linear identification [18, 29] is in general invalid for nonlinear identification and a modified Box–Pierce test is proposed in this work in the context of nonlinear system identification. The second approach is the relative and cumulative contribution approach. The approach utilizes the orthogonal properties of the basis functions and is simple and effective. To present the material without interruption, all the proofs are provided in Appendix.

2 System and Orthogonal Basis Functions

Consider a general nonparametric nonlinear finite impulse response (FIR) system

$$\begin{aligned} y[k]&=f(u[k-1],u[k-2],...,u[k-n])+v[k] \\&=\bar{c}+\sum _{j=1}^n\bar{f}_j(u[k-j])+ \sum _{1\le j_1< j_2 \le n}\bar{f}_{j_1j_2}(u[k-j_1],u[k-j_2])+... \\&+\sum _{1\le j_1<j_2<...< j_m\le n}\bar{f}_{j_1j_2...j_m} (u[k-j_1],u[k-j_2],...,u[k-j_m]) \\&+ v(k),~k=1,2,...,N \end{aligned}$$

where y[k] and u[k] are output and input measurements. It is assumed that

  1. 1.

    The input u[k] is an independent and identically distributed (iid) random sequence in a (unknown) open interval \(I \in R\) with a (unknown) probability density function \(\psi (\cdot )\). The noise v[k] is a sequence of iid random variables with zero mean and bounded variance.

  2. 2.

    The exact time lag is unknown and only the upper bound n is available.

  3. 3.

    The functions \(\bar{f}_{j_1j_2...j_l}\)’s, \(l \le n\), referred to as l-factor terms, are unknown and describe interactions of variables \(u[k-j_1],u[k-j_2],..., u[k-j_l]\). No structural prior information on \(\bar{f}_{j_1j_2...j_l}\)’s is assumed.

To convey the idea clearly without tedious and unilluminating detailed technical derivations, we will focus on the system with upto 2-factor interactive terms in this work.

$$\begin{aligned} \nonumber y[k]&=f(u[k-1],u[k-2],...,u[k-n])+v[k] \\&=\bar{c}+\sum _{j=1}^n\bar{f}_j(u[k-j])+ \sum _{1\le j_1 < j_2 \le n}\bar{f}_{j_1j_2}(u[k-j_1],u[k-j_2])+v[k] \end{aligned}$$
(1)

All the results of this work can be trivially but cumbersomely extended to a general system with arbitrary interactive terms. Obviously, for a system upto 2-factor interactive terms, there are totally \(M+1=1+n+n(n-1)/2\) terms in the system, one constant term, n 1-factor terms \(\bar{f}_j\)’s and \(\frac{n(n-1)}{2}\) 2-factor terms \(\bar{f}_{j_1j_2}\)’s.

What we are concerned are:

  • How to determine orthogonal basis functions \(\phi _i(\cdot )\)’s, \(i=0,1,...,M\), based on the given data set \(\{y[k],u[k]\}_1^N\) that represents the unknown system (1)?

  • How to identify these basis functions?

  • Once the basis functions \(\phi _i(\cdot )\)’s are determined, it does not mean that all \(M+1\) terms are needed. In most practical cases, only the terms \(i=0,1,...,p <M+1\) are needed. How to find the order p?

  • Even the order p is found, the system could be sparse in the sense that not all terms \(i=0,1,2,...,p\) are present and many terms are actually zero. How to identify those terms so they can be removed?

In the following derivation, we denote the expectation operator by \(\mathbf{E}\) and conditional expectation operators for given \(u[k-j_1]=x_{j_1}\), and/or \(u[k-j_2]=x_{j_2}\) by, respectively,

$$ \mathbf{E}(y[k]~|~u[k-j_1]=x_{j_1}), $$
$$ \mathbf{E}(y[k]~|~u[k-j_1]=x_{j_1},u[k-j_2]=x_{j_2}),~ $$
$$ \mathbf{E}(f_{j_1j_2}(u[k-j_1],u[k-j_2])~|~u[k-j_1]=x_{j_1}),~ $$
$$ \mathbf{E}(f_{j_1j_2}(u[k-j_1],u[k-j_2])~|~u[k-j_2]=x_{j_2}). $$

For every \(x_{j_1}\) and \(x_{j_2} \in I\), define the normalized functions \(f_{j_1j_2}\)’s and \(f_j\)’s in (2).

$$\begin{aligned} f_{j_1j_2}(x_{j_1},x_{j_2})&= \bar{f}_{j_1j_2}(x_{j_1},x_{j_2}) -\mathbf{E}( \bar{f}_{j_1j_2}(u[k-j_1],u[k-j_2])~|~u[k-j_2]=x_{j_2}) \\&-\mathbf{E}( \bar{f}_{j_1j_2}(u[k-j_1],u[k-j_2])~|~u[k-j_1]=x_{j_1}) \\&+\underbrace{\mathbf{E}\{ \bar{f}_{j_1j_2}(u[k-j_1],u[k-j_2])\}}_{c_{j_1j_2}}, ~~1\le j_1<j_2\le n \\ f_1(x_1)&= \bar{f}_1(x_1) + \sum _{i=2}^n \mathbf{E}(\bar{f}_{1i}(u[k-1],u[k-i])~|~ u[k-1]=x_1) \\&-\underbrace{\mathbf{E}\{\bar{f}_1(u[k-1]) +\sum _{i=2}^n \mathbf{E}(\bar{f}_{1i}(u[k-1],u[k-i])~|~ u[k-1]=x_1)\} }_{c_1}\\ f_j(x_j)&= \bar{f}_j(x_j) +\sum _{i=j+1}^n \mathbf{E}(\bar{f}_{ji}(u[k-j],u[k-i])~|~ u[k-j]=x_j) \end{aligned}$$
$$\begin{aligned} \nonumber&+\sum _{i=1}^{j-1} \mathbf{E}(\bar{f}_{ij}(u[k-i],u[k-j])~|~u[k-j]=x_j) \\ \nonumber&-\underbrace{\mathbf{E}\{ \bar{f}_j(u[k-j])+\sum _{i=j+1}^n \mathbf{E}(\bar{f}_{ji}(u[k-j],u[k-i])~|~u[k-j]=x_j)}_ { c^1_j} \\ \nonumber&+\underbrace{ \sum _{i=1}^{j-1} \mathbf{E}(\bar{f}_{ij}(u[k-i],u[k-j])~|~u[k-j]=x_j)\} }_{c^2_j}, ~j=2,...,n-1\\ \nonumber f_n(x_n)&= \bar{f}_n(x_n) +\sum _{i=1}^{n-1} \mathbf{E}(\bar{f}_{in}(u[k-i],u[k-n])~| ~u[k-n]=x_n) \\ \nonumber&- \underbrace{\mathbf{E}\{\bar{f}_n(u[k-n])+\sum _{i=1}^{n-1} \mathbf{E}(\bar{f}_{in}(u[k-i],u[k-n])~|~u[k-n]=x_n) \} }_{c_n} \\ c&=\bar{c} -\sum _{1\le j_1 < j_2 \le n}c_{j_1j_2}+\sum _{j=1}^nc_j,~~with~~ c_j=c^1_j+c^2_j. \end{aligned}$$
(2)

Then, the system (1) can be rewritten as

$$\begin{aligned} y[k]&=c+\sum _{j=1}^nf_j(u[k-j])+ \sum _{1\le j_1<j_2 \le n} f_{j_1j_2}(u[k-j_1], u[k-j_2]) \nonumber \\&+v[k],~k=1,2,\ldots ,N \end{aligned}$$
(3)

We are now in a position to define data dependent orthogonal basis functions \(\phi _i\), \(i=0,...,M\).

$$\begin{aligned}&\phi _0=c \Longrightarrow \phi _0 ~~~~~\phi _j(x_j)=f_j(x_j),~j=1,...,n \Longrightarrow \phi _1,...,\phi _n,\\&\phi _{\frac{2n}{2}-1+j}(x_1,x_j)=f_{1j}(x_1,x_j),~j=2,...,n ~~~\Longrightarrow \phi _{n+1},...,\phi _{2n-1}, \\&\phi _{\frac{2n-1}{2}2-2+j}(x_2,x_j)=f_{2j}(x_2,x_j),~j=3,...,n ~~~\Longrightarrow \phi _{2n},...,\phi _{3n-3}, \\&\phi _{\frac{2n-2}{2}3-3+j}(x_3,x_j)=f_{3j}(x_3,x_j),~j=4,...,n ~~~\Longrightarrow \phi _{3n-2},...,\phi _{4n-6}, \\&\qquad \ldots \\&\phi _{\frac{2n-(n-3)}{2}(n-2)-(n-2)+j}(x_{n-2},x_j)= f_{(n-2)j}(x_{n-2},x_j), ~ \Longrightarrow \phi _{\frac{n^2+n}{2} -2},\phi _{\frac{n^2+n}{2} -1}, \\&\qquad \qquad \qquad \qquad \qquad \qquad \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! j=n-1,n \\&\phi _{\frac{2n-(n-2)}{2}(n-1)-(n-1)+j}(x_{n-1},x_j)=f_{(n-1)j}(x_{n-1},x_j), ~~~\Longrightarrow \phi _{\frac{n^2+n}{2}}, j=n \end{aligned}$$

When the meaning is clear from the context, we interchangeably use

$$\begin{aligned} \phi _j[k]&=\phi _j(u[k-j]), ~j=1,...,n \\\phi _j[k]&=\phi _j(u[k-1],u[k-j+n-1]), j=n+1,...,2n-1 \\\phi _j[k]&=\phi _j(u[k-2],u[k-j+2n-3]),j=2n,...,3n-3 \\&\ldots \\\phi _j[k]&=\phi _j(u[k-n+2],u[k-j+M-n-1]), j=M-2,M-1 \\\phi _j[k]&=\phi _j(u[k-n+1],u[k-n]), j=M=n(n+1)/2. \end{aligned}$$

Clearly, \(\phi _0\) denotes the constant term, \(\phi _j(x_j)\)’s, \(j=1,...,n\), represent the 1-factor terms and \(\phi _i(x_{j_1},x_{j_2})\)’s, \(i=n+1,...,M\), are 2-factor terms. The following theorem is the main result of this section.

Theorem 1

Consider the system (1). Then we have:

  1. 1.

    The system (1) can be represented by the data-driven basis functions \(\phi _i\)’s,

    $$\begin{aligned} y[k]=\sum _{i=0}^M \phi _i[k]+v[k] \end{aligned}$$
    (4)

    where \(M=n+n(n-1)/2=n(n+1)/2\).

  2. 2.

    The data-driven basis functions \(\phi _i\)’s are orthogonal. i.e., for all \(1\le j \le M\) and \(0\le j_1 < j_2 \le M\),

    $$ \mathbf{E}\phi _j[k]=0,~~\mathbf{E}\phi _{j_1}[k]\phi _{j_2}[k]=0. $$
  3. 3.

    The unknown \(\phi _j\)’s are the expectations or conditional expectations of the output,

    $$\begin{aligned} \nonumber \phi _0&=\mathbf{E}\{y[k]\}, \\ \nonumber \phi _j(x_j)&= \mathbf{E}\{y[k]~|~u[k-j]=x_j\}-\phi _0, ~j=1,...,n, \\ \nonumber \phi _{\frac{2n}{2}-1+j}(x_1,x_j)&=\mathbf{E}\{y[k]~|~u[k-1]=x_1,u[k-j]=x_j\} \\ \nonumber&-\phi _1(x_1)-\phi _j(x_j)-\phi _0,~j=2,...,n \\ \nonumber \phi _{\frac{2n-1}{2}2-2+j}(x_2,x_j)&=\mathbf{E}\{y[k]~|~u[k-2]=x_2,u[k-j]=x_j,\} \\ \nonumber&-\phi _2(x_2)-\phi _j(x_j)-\phi _0,~j=3,...,n \\ \nonumber&\ldots \\ \nonumber \phi _{\frac{2n-(n-3)}{2}(n-2)-(n-2)+j}(x_{n-2},x_j)&=\mathbf{E}\{y[k]~|~u[k-n+2]=x_{n-2},u[k-j]=x_j\} \\ \nonumber&-\phi _{n-2}(x_{n-2})-\phi _j(x_j)-\phi _0, ~j=n-1,n \\ \nonumber \phi _{\frac{2n-(n-2)}{2}(n-1)-(n-1)+j}(x_{n-1},x_j)&= \mathbf{E}\{y[k]~|~u[k-n+1]=x_{n-1},u[k-j]=x_j\} \\&-\phi _{n-1}(x_{n-1})-\phi _j(x_j)-\phi _0, ~j=n \end{aligned}$$
    (5)

From the theorem, we see that not only the system (1) can be represented by the data-driven basis functions \(\phi _i\)’s as in (4) but also these basis functions are orthogonal and can be estimated separately. If the estimate \(\widehat{y}=\sum _{i=0}^p\phi _i[k]\) is deemed to be not sufficient enough and an additional term \(\phi _{p+1}[k]\) is needed, then only the additional term \(\phi _{p+1}[k]\) has to be identified and added to the model. No previously obtained terms \(\phi _i,~i=0,1...,p\) have to be reestimated.

3 Identification Under Random Inputs

Though the basis functions \(\phi _i\)’s are determined, they depend on the unknown system and have to be identified from the given data set. From Theorem 1, these unknown \(\phi _i\)’s are the expectations or conditional expectations of the output. Now the question is how to calculate these expectation values by empirical averages based on the available input–output measurement data set \(\{y[k],u[k]\}_1^N\). In this work, we adopt a fairly simple yet efficient kernel approach which was developed in our previous works [5, 6]. To this end, let \(x_j\) be any point in the interval I in which the input \(u[\cdot ]\) lies, define

$$ \varphi _j(x_j,k)= |u[k-j]-x_j|. $$

Let \(\delta > \min \varphi _j(x_j,k)\) be any positive constant. Let

$$ M_j(x_j)=\{ m_j(1),m_j(2),...,m_j(l_j)\} $$

be a set that contains integers \(m_j(i)\)’s such that \(m_j(i) \in M_j(x_j) \Leftrightarrow \delta > \varphi _j(x_j, m_j(i))\). \(l_j(x_j)\) is the number of elements in \(M_j(x_j)\) that is the same as the number of \(\varphi _j(x_j,k)\)’s that are smaller than \(\delta \). Define, for each j and \(x_j\),

$$ w_j(x_j,k)= \left\{ \begin{array}{ll} {{\delta -\varphi _j(x_j,k)} \over { l_j\delta -\sum _{i=1}^{l_j} \varphi _j (x_j,m_j(i))}} &{} k \in M_j(x_j) \\ 0 &{} k \not \in M_j(x_j) \end{array} \right. . $$

Obviously for all k, j and \(x_j\), \(w_j(x_j,k) \ge 0\) and \(\sum _{k=1}^N w_j(x_j,k) = \sum _{i=1}^{l_j} w_j(x_j, m_j(i))=1\). Similarly, for any pair \(0 \le j_1 < j_2 \le n\) and \((x_{j_1},x_{j_2}) \in I^2\), define

$$ \varphi _{j_1j_2}(x_{j_1},x_{j_2},k)= \Vert (u[k-j_1],u[k-j_2])-(x_{j_1}, x_{j_2})\Vert _2. $$

If \(\delta > \min \varphi _{j_1j_2}(x_{j_1},x_{j_2},k)\), let \(M_{j_1j_2}(x_{j_1},x_{j_2})=\{ m_{j_1j_2}(1),m_{j_1j_2}(2),..., m_{j_1j_2}(l_{j_1j_2})\}\) be a set such that \(k \in M_{j_1j_2}(x_{j_1},x_{j_2}) \Leftrightarrow \delta > \varphi _{j_1j_2}(x_{j_1},x_{j_2}, k)\). Define

$$ w_{j_1j_2}(x_{j_1},x_{j_2},k)= \left\{ \begin{array}{ll} {{\delta -\varphi _{j_1j_2}(x_{j_1},x_{j_2},k)} \over {l_{j_1j_2}\delta - \sum _{i=1}^{l_{j_1j_2}} \varphi _j (x_{j_1},x_{j_2},m_{j_1j_2}(i))}} &{} k \in M_{j_1j_2}(x_{j_1},x_{j_2}) \\ 0&{} k \not \in M_{j_1j_2}(x_{j_1},x_{j_2}) \end{array} \right. . $$

Notice that the same properties hold

$$ w_{j_1j_2}(x_{j_1},x_{j_2},k) \ge 0,~ \sum _{k=1}^Nw_{j_1j_2}(x_{j_1},x_{j_2},k) = \sum _{i=1}^{l_{j_1j_2}}w_{j_1j_2}(x_{j_1},x_{j_2},m_{j_1j_2}(i)) =1. $$

Now, for a given pair \((x_{j_1},x_{j_2}) \in I^2\), we define the estimates \(\widehat{\phi }_i\), \(i=0,1,...,M\),

$$\begin{aligned}&\nonumber \widehat{\phi }_0= {1 \over N} \sum _{k=1}^{N} y[k], \\&\nonumber \widehat{\phi }_j(x_j)= \sum _{k=1}^N w_j(x_j,k)y[k]-\widehat{\phi _0}, ~j=1,...,n, \\&\nonumber \widehat{\phi }_{\frac{2n}{2}-1+j}(x_1,x_j) =\sum _{k=1}^N w_{1j}(x_1,x_j,k)y[k] -\widehat{\phi }_1(x_1)-\widehat{\phi }_j(x_j)-\widehat{\phi }_0,~j=2,...,n \\&\nonumber \widehat{\phi }_{\frac{2n-1}{2}2-2+j}(x_2,x_j) =\sum _{k=1}^N w_{2j}(x_2,x_j,k)y[k], -\widehat{\phi }_2(x_2)-\widehat{\phi }_j(x_j)-\widehat{\phi }_0,~j=3,...,n \\&\qquad \nonumber \ldots \\&\nonumber \widehat{\phi }_{\frac{2n-(n-3)}{2}(n-2)-(n-2)+j}(x_{n-2},x_j)= \sum _{k=1}^N w_{n-2,j}(x_{n-2},x_j,k)y[k] \\&\nonumber \qquad \qquad \qquad \qquad -\widehat{\phi }_{n-2}(x_{n-2})-\widehat{\phi }_j(x_j)-\widehat{\phi }_0, ~j=n-1,n \\&\nonumber \widehat{\phi }_{\frac{2n-(n-2)}{2}(n-1)-(n-1)+j}(x_{n-1},x_j) = \sum _{k=1}^Nw_{n-1,j}(x_{n-1},x_j,k)y[k] \\&\qquad \qquad \qquad \qquad -\widehat{\phi }_{n-1}(x_{n-1})-\widehat{\phi }_j(x_j)-\widehat{\phi }_0, ~j=n \end{aligned}$$
(6)

Theorem 2

Consider the system (4) and the estimates above. For any \(x_{j_1},x_{j_2} \in I\), assume

  • The unknown basis functions \(\phi _i\)’s are differentiable with the Lipschitz constant L for \(x_{j_1},x_{j_2}\in I\).

  • Let \(\psi (\cdot )\) be the (unknown) probability density function of the input \(u[\cdot ]\) and \(\psi (\cdot )\) is nonzero at \(x_{j_1},x_{j_2}\), i.e.,

    $$ \psi (x_{j_1})> 0,~ \psi (x_{j_2}) > 0. $$
  • \(\delta \rightarrow 0\) and \(\delta ^2N \rightarrow \infty \) as \(N \rightarrow \infty \).

Then, as \(N \rightarrow \infty \), we have in probability that

$$ \widehat{\phi }_0 \rightarrow \phi _0,~~ $$
$$ \widehat{\phi }_j(x_j) \rightarrow \phi _j(x_j), ~j=1,2,...,n $$
$$ \widehat{\phi }_j(x_{j_1},x_{j_2}) \rightarrow \phi _j(x_{j_1},x_{j_2}),~1\le j_1< j_2 \le n,~j=n+1,...,M $$

Moreover asymptotically, \( |\widehat{\phi }_j(x_j) - \phi _j(x_j)|^2 \sim O(\delta +\frac{1}{\delta N}), ~j=1,2,...,n\) and

\(|\widehat{\phi }_j(x_{j_1},x_{j_2}) - \phi _j(x_{j_1},x_{j_2})|^2 \sim O(\delta +\frac{1}{\delta ^2 N}),~1\le j_1{<} j_2 \le n,~ j=n+1,...,M\).

4 Order Determination

How many terms should be included in the model or equivalently how to determine the order p of the estimate \(\widehat{f}=\sum _{i=0}^p \widehat{\phi }_i[k]\) is an important and difficult part of identification. This amounts to if the chosen order is sufficient to represent the unknown nonlinear system or an additional term or terms should be added to the estimate. A related issue is the regressor selection. Even if the order is accurately obtained, some terms \(\phi _i\)’s are irrelevant to the output and should not be included in the estimate. How to find and remove those terms are also important. These two issues are closely related. We propose two approaches towards these two issues.

4.1 Combined Residual Analysis and Statistical Test

The idea of the statistical test is fairly simple. Suppose the order p is sufficient so that the estimate \(\sum _{i=0}^p \phi _i[k]\) represents the true but unknown system f well. Then, the residual

$$ r[k]=y[k]-\sum _{i=0}^p \phi _i[k] \approx v[k] $$

is almost white. In other words, if the residual is white, nothing more can be squeezed out from the data and thus the order p is sufficient. Let

$$ \mu =\mathbf{E}r[k],~\gamma [j]=\mathbf{E}(r[k]-\mu )(r[k-j]-\mu ), ~ \rho [j]=\gamma [j]/\gamma [0] $$

denote the mean, the lag-j autocovariance and the lag-j correlation coefficient of r[k], respectively. If the residual r[k] is white, it follows that

$$ \gamma [1]=\gamma [2]=...=0,~~\rho [1]=\rho [2]=...=0 $$

In particular, for the system (4), \(r[k]=y[k]-\sum \phi _i[k]\) is a function of \(u[k-1], u[k-2],...,u[k-n]\) and \(r[k-n]=y[k-n]-\sum \phi _i[k-n]\) is a function of \(u[k-n-1],u[k-n-2],...,u[k-2n]\). They are automatically independent. Thus, what we have to to check is if

$$ \rho [1]=\rho [2]=...=\rho [n-1]=0 $$

The most effective test in the literature for checking if \(\rho [1]=\rho [2]=...=\rho [n-1]=0\) are the Box–Pierce test [9] and its variants which have been widely accepted and applied for linear system identification [18, 29]. It states as follows: for large N,

$$\begin{aligned} N \sum _{j=1}^{n-1} \rho [j]^2 = N (\rho [1],...,\rho [n-1]) \begin{pmatrix}\rho [1] \\ \vdots \\ \rho [n-1] \end{pmatrix} \end{aligned}$$
(7)

follows a chi-square distribution with (n-1) degree of freedom if r[k] is white. This provides a framework for statistical hypothesis tests. Let

$$ H_0: \text {the residual } r[k] \text { is white.} $$

Then, the null hypothesis \(H_0\) can be tested based on \(N \sum _{j=1}^{n-1} \rho [j]^2\) and the \(\chi ^2(n-1)\) distribution. If \(H_0\) is accepted, r[k] is considered to be white and the order p is accepted. To test the hypothesis, we calculate \(N \sum _{j=1}^{n-1} \rho [j]^2\) based on the residual. Let the threshold d be taken from the \(\chi ^2(n-1)\) distribution with \(\alpha \) being the level of significance, i.e., the probability to reject \(H_0\) though \(H_0\) is true. The hypothesis \(H_0\) is accepted if \(N \sum _{j=1}^{n-1} \rho [j]^2 \le d\) and is rejected if \(N \sum _{j=1}^{n-1} \rho [j]^2 > d\) and we conclude that the order p is not high enough.

There are two problems however. The first is that what we really test is not if the residual r[k] is white or not but if r[i] and r[j] are uncorrelated or not. The Box–Pierce test (7) works well for this purpose in linear identification but may not work for nonlinear identification. If the residual r[k] exhibits some nonlinear dependence which is usually the case in nonlinear identification because no actual \(\phi _i\)’s are available and only their estimates \(\widehat{\phi }_i\)’s are known. This unavoidably adds some nonlinear dependence on the residual. In such a case, the Box–Pierce test does not work well. In fact, the Box–Pierce test could be invalid and provide some misleading conclusions [32]. Therefore, a modified Box–Pierce test is needed in the presence of nonlinear dependence of r[k]. The second problem is that even the null hypothesis \(H_0\) is accepted, it does not necessarily mean that r[k] is white. Since the null hypothesis only tests if \(H_0\) should be accepted given \(H_0\) is true. There is no way of knowing the probability

$$ Prob\{~accept~H_0 :~H_0~is~false\} $$

This is referred to as the second type of error and is hard to answer. Thus, there must an additional and independent test to ensure reasonably that \(H_0\) is not false. We deal with these two problems separately.

Modified Box–Pierce test: Let \(r[k]=y[k]-\sum _{i=0}^p \widehat{\phi }_i[k]\) be the residual. Denote the sampled mean, the lag-j autocovariance, the lag-j correlation coefficient by respectively

$$ \widehat{\mu } \,{=}\, \frac{1}{N} \sum _{k=1}^N r[k],~\widehat{\gamma }[j]\,{=}\, \frac{1}{N-j}\!\sum _{k=j+1}^N (r[k]-\widehat{\mu })(r[k-j]-\widehat{\mu }),~ \widehat{\rho }\,[j]\,{=}\, \widehat{\gamma }[j]/\widehat{\gamma }[0] $$

It was shown in [19] that for large N,

$$\begin{aligned} N (\widehat{\rho }\,[1],...,\widehat{\rho }\,[n-1]) V^{-1} \begin{pmatrix}\widehat{\rho }\,[1] \\ \vdots \\ \widehat{\rho }\,[n-1]\end{pmatrix} \end{aligned}$$
(8)

follows a chi-square distribution with (n-1) degree of freedom when \(H_0\) is true, where

$$ V=C/\gamma [0]^2 =\begin{pmatrix} c_{11} &{} \ldots &{} c_{1,n-1} \\ \vdots &{} \ddots &{} \vdots \\ c_{n-1,1} &{} \ldots &{} c_{n-1,n-1}\end{pmatrix} /\gamma [0]^2 $$
$$ c_{ij}=\sum _{q=-\infty }^\infty \mathbf{E}(r[k]-\mu )(r[k-i]-\mu )(r[k+q]-\mu )(r[k+q-j]-\mu ) $$
$$ ~~~~~~~~~~~i,j=1,...,n-1 $$

with \(\mu \) being the mean value of r[k]. The difference is that the identity matrix is used in the Box–Pierce test (7) while in the modified Box–Pierce test (8), the actual autocovariance matrix V is used. The modified Box–Pierce test is reliable for large N even the residual r[k] exhibits nonlinear dependence. For our application, however, the actual autocovariance matrix V is unknown and has to be estimated. To this end, let

$$ W[k]= \begin{pmatrix}(r[k]-\widehat{\mu })(r[k-1]-\widehat{\mu }) \\ (r[k]-\widehat{\mu })(r[k-2]-\widehat{\mu }) \\ \vdots \\ (r[k]-\widehat{\mu })(r[k-n+1]-\widehat{\mu }) \end{pmatrix} $$

and K(x) be the triangle kernel function

$$ K(x) = \left\{ \begin{array}{ll} 1-|x|, &{} |x| \le 1 \\ 0, &{} |x| >1 \end{array} \right. $$

Now, define the estimate \(\widehat{V}\) of V by \(\widehat{C}/\widehat{\gamma }[0]^2\) with

$$\begin{aligned} \widehat{C}&= \sum _{q=-l}^l K(\frac{q}{l}) \frac{1}{N-n+1-|q|} \sum _k W[k]W[k-q]' \\&=\sum _{q=-l}^0 K(\frac{q}{l}) \frac{1}{N-n+1+q} \sum _{k=n}^{N+q} W[k]W[k-q]' \\&+\sum _{q=1}^l K(\frac{q}{l}) \frac{1}{N-n+1-q} \sum _{k=n+q}^N W[k)W[k-q]' \end{aligned}$$

where l is the bandwidth of the kernel \(K(\cdot )\). Note all the variables \(\widehat{\mu }\), \(\widehat{\rho }\,[j]\), W[k] and \(\widehat{\gamma }[j]\) are computable. Now, we show that the modified Box–Pierce test is still valid if the actual autocovariance matrix V is replaced by its estimate as discussed above,

Theorem 3

Consider the residual r[k] and the corresponding \(\widehat{\mu }\), \(\widehat{\gamma }[j]\), \(\widehat{\rho }\,[j]\) and \(\widehat{V}=\widehat{C}/\widehat{\gamma }[0]^2\). Then,

$$\begin{aligned} Q_{n-1} = N (\widehat{\rho }\,[1],...,\widehat{\rho }\,[n-1]) \widehat{V}^{-1} \begin{pmatrix}\widehat{\rho }\,[1] \\ \vdots \\ \widehat{\rho }\,[n-1]\end{pmatrix} \end{aligned}$$
(9)

converges, in distribution as \(N \rightarrow \infty \), to a chi-square distribution with (n-1) degree of freedom if the residual r[k] is white, provided that

$$ l \rightarrow \infty ,~l/N \rightarrow 0,~as~N \rightarrow \infty $$

Residual analysis: As discussed above, the hypothesis test is effective only it is reasonably sure that \(H_0\) is not false. A very simple but a common sense way is to check the magnitude of the residual. There are two purposes. If the estimate represents the system well or the order is adequate, the residual should be small. On the other hand, we do not want to over-fit the system. In this regard, the parsimony principle applies. Let \(r_p[k] =y[k]-\sum _{i=0}^p \widehat{\phi }_i[k]\) be the residual where the subscript p indicates the order of the estimate. Define the average error

$$ e[p]=\frac{1}{N} \sum _{k=1}^N r_p[k]^2 $$

Obviously, the average error e[p] is a monotonically decreasing function of the order p as depicted in the top diagram of Fig.  4. Initially, e[p] decreases as the order increases because the model picks up relevant terms \(\phi _i\)’s of the unknown system. However, even when the correct order has been reached, the value e[p] still decreases because additionally added terms try to model noise. The improved “fit” is harmful since it models noise but not the system. However, the decrease from the over-fit is less significant than the decrease when the relevant terms are picked up by the estimate. Therefore, what we are looking for is where the curve e[p] is small and flattened, known as the “knee” in Fig. 4.

We are now in a position to state the combined residual analysis and hypothesis test approach for order determination.

Step 1: Carry out identification by estimating \(\widehat{\phi }_i\) as described in the previous sections.

Step 2: Calculate the residual \(r_p[k]\) for each p and plot the average error e[p] vs p as shown in the top diagram of Fig. 4.

Step 3: Find the knee in the curve where the average error e[p] is small and flattened. Determine the corresponding order p for the hypothesis test.

Step 4: Calculate \(Q_{n-1}\) as in (9) and carry out the modified Box–Pierce test. Let the threshold d be taken from the \(\chi ^2(n-1)\) distribution with \(\alpha \) being the level of significance usually 0.03–0.05, i.e., the probability to reject \(H_0\) though \(H_0\) is true. The hypothesis \(H_0\) is accepted if \(Q_{n-1} \le d\) and we conclude that the order p is sufficient. The hypothesis \(H_0\) is rejected if \(Q_{n-1}> d\) and we conclude that the order p is not high enough and an additional term or terms should be included in the estimate. Then, the test is repeated with \(p\rightarrow p+1\).

4.2 Relative and Cumulative Contribution Approach

In order determination, what we are interested in is not if a particular term \(\phi _i[k]\) contributes or not, but whether the contribution is significant or not. Identification is always a balance between model accuracy and model parsimony. The data-driven orthogonal approach discussed in the previous sections allows us to decompose the total contribution into a sum of individual contributions, referred to as the relative contribution in this work, and provides a reliable way for the order determination and regressor selection. To this end, we propose a relative contribution approach for order determination and regressor selection that exploits the orthogonal properties of the basis functions. Consider the system (4). It is easily verified from the orthogonal properties of \(\phi _i[k]\)’s that

$$ \mathbf{E}y[k]^2 =\mathbf{E}\{ \sum _{i=0}^M \phi _i[k]+v[k]\}^2 =\sum _{i=0}^M \mathbf{E}\phi _j[k]^2 +\mathbf{E}v[k]^2 $$

We now define the relative contribution \(R_c[j]\) as

$$ R_c[j]= {{\mathbf{E}\phi _j[k]^2} \over { \mathbf{E}y[k]^2}},~j=0,...,M $$

Since the square term is proportional to energy, the meaning of the relative contribution \(R_c[p]\) is the percent of energy in the p’s term to the overall output energy. Obviously, if the p’s term is insignificant, the relative contribution \(R_c[p]\) should be small and not be a part of the estimate.

A closely related concept is the cumulative contribution \(C_c[p]\)

$$ C_c[p]= \sum _{j=0}^p R_c(j)= \sum _{j=0}^p {{\mathbf{E}\phi _j[k]^2} \over { \mathbf{E}y[k]^2}},~ p=0,...,M $$

which measures the contribution of first \(p+1\) terms relative to the overall output. Obviously, if the order p is correct, the cumulative contribution \(C_c[p]\) should be close to unit and is flattened in the curve \(C_c[p]\) vs p. It is important to point out that because of noise contribution term \(\mathbf{E}v[k]^2\), the cumulative contribution can never reach 100%. To test the order based on the cumulative contribution, an estimate of the relative contribution of the unknown noise has to be done. This makes the method based on the cumulative contribution less efficient compared to the relative contribution approach.

In reality, \(\phi _i'\) are unavailable and only their estimates \(\widehat{\phi }_i\)’s are available. However, because of their convergence properties, \( \widehat{\phi }_i \rightarrow \phi _i\) as \(N \rightarrow \infty \), we may define the estimates of \(R_c[p]\) and \(C_c[j]\) by

$$ \widehat{R}_c[j]= {{ \frac{1}{N} \sum _{k=1}^N\widehat{\phi }_j[k]^2 } \over { \frac{1}{N} \sum _{k=1}^N y[k]^2}} $$

and

$$ \widehat{C}_c[p]= \sum _{j=0}^p {{ \frac{1}{N} \sum _{k=1}^N\widehat{\phi }_j[k]^2 } \over { \frac{1}{N} \sum _{k=1}^N y[k]^2}}. $$

The substitution is reliable for large N because of the convergence property.

To test whether the pth term should be included, we compute \(\widehat{R}_c[p]\) and choose a threshold \(d_1\), for example \(d_1=0.03\) or 3%. If \(\widehat{R}_c[p] \ge d_1\), the pth term is included. Otherwise the term is discarded. This not only provides the order of the system but also determines exactly which term should be included in the model.

5 Deterministic Inputs and Galois Sequence

Generally, there are two ways to estimate the structure of the system. The first one is full scale system identification. The idea is to identify the system including each \(\bar{f}_i\) and \(\bar{f}_{ij}\) and then enumerate all possible models for different combinations of \(\bar{f}_i\) and \(\bar{f}_{ij}\) as well as n. Some performance measures are calculated and the model that achieves the best performance is chosen. Then, the corresponding n is the estimate of time lag and the surviving terms of \(\bar{f}_i\) and \(\bar{f}_{ij}\) are retained in the system. All other \(\bar{f}_i\)’s and \(\bar{f}_{ij}\)’s are considered to be negligible. The method does not distinguish between model structural estimation and full scale system identification. Note that the system is nonparametric and nonlinear. Hence, identification is usually computationally expensive and the optimization algorithm could be stuck in a local minimum. It is certainly advantageous if the structure of the system can be estimated before a full scale system identification is performed. To this end, we propose two different methods.

5.1 Visual Inspection Method

Recall that in structural estimation, we are interested not in full scale system identification, but rather in finding a simple and reliable way to estimate the structure, in particular to determine the terms \(\bar{f}_i\) and \(\bar{f}_{ij}\) which contribute significantly. In this section, we assume that the input is at our disposal (which admittedly may be restrictive in some applications). Under such an assumption, the first problem is to find an input sequence that is simple and has the ability to separate the contributions of \(\bar{f}_i\) and \(\bar{f}_{ij}\),

$$\begin{aligned} U_{2^3}= \begin{pmatrix} u(1) &{} u(0) &{} u(-1) \\ u(2) &{} u(1) &{} u(0) \\ u(3) &{} u(2) &{} u(1) \\ u(4) &{} u(3) &{} u(2) \\ u(5) &{} u(4) &{} u(3) \\ u(6) &{} u(5) &{} u(4) \\ u(7) &{} u(6) &{} u(5) \\ u(8) &{} u(7) &{} u(6) \\ u(9) &{} u(8) &{} u(7) \end{pmatrix} =\begin{pmatrix} a_1 &{} a_1 &{} a_1 \\ a_2 &{} a_1 &{} a_1 \\ a_2 &{} a_2 &{} a_1 \\ a_2 &{} a_2 &{} a_2 \\ a_1 &{} a_2 &{} a_2 \\ a_2 &{} a_1 &{} a_2 \\ a_1 &{} a_2 &{} a_1 \\ a_1 &{} a_1 &{} a_2 \\ a_1 &{} a_1 &{} a_1 \end{pmatrix}. \end{aligned}$$
(10)

To this end, let l be a prime number that indicates the number of levels of input, i.e., \(u[k]=\{a_1,a_2,...,a_l\}\), usually \(|a_i| \not = |a_j|\) to avoid ambiguity for quadratic nonlinearities. To excite the system to the maximum extent, the input sequence should contain all possible combinations of n-tuple \((a_{i_1},a_{i_2},\ldots ,a_{i_n})\), \(a_{i_j}=a_1,\ldots ,,a_l\). The minimum length of such a generating sequence is \(n+l^n-1\). The Galois sequence is such a sequence which has been investigated in [13, 20] for worst-case identification. Galois sequence has many desirable properties. It is a periodic pseudorandom sequence with period \(l^n\) [20] and can be easily generated [13]. More importantly, within one period, it produces each n-tuple combination exactly once [20]. Note that the Galois sequence defined here is slightly different from the traditional one [13] as we need all the n-tuples to be included. This small difference can be easily taken care of and in fact this definition is exactly the same as in [20]. An example of \(G(l^n)\) for \(n=3\) and \(l=2\) is given in (10). To average out the effect of noise, the input sequence is repeated L times, i.e.,

$$\begin{aligned} U_{Ll^n}=\left. \begin{array}{c} \begin{pmatrix} U_{l^n} \\ U_{l^n} \\ \vdots \\ U_{l^n}\end{pmatrix} \end{array} \right\} {L} \text{ times }. \end{aligned}$$
(11)

Before performing structural estimation, it is interesting to observe that the representation (1) of the system is actually not unique. For instance, \(\bar{f}_1 \rightarrow \bar{f}_1+c\) and \(\bar{f}_2 \rightarrow \bar{f}_2-c\) for any constant c would not change the input–output relationship which implies that the structure of the system, as represented in (1), is not identifiable. To overcome this problem, we normalize the system to make the averages of \(\bar{f}_i\) and \(\bar{f}_{ij}\) with respect to the input equal to zero. Let

$$ g_{j,ij}(u[k-j])={1 \over l} \sum _{m=1}^l \bar{f}_{ij}(a_m,u[k-j]),~ $$
$$ g_{i,ij}(u[k-i])={1 \over l} \sum _{m=1}^l \bar{f}_{ij}(u[k-i],a_m) $$

be the partial average of \(\bar{f}_{ij}\) with respect to the first and second variables respectively and

$$ \check{c}_{ij}={1 \over l^2} \sum _{m_1=1}^l \sum _{m_2=1}^l \bar{f}_{ij}(a_{m_1},a_{m_2}) $$

be the total average. Define

$$\begin{aligned} \check{f}_{ij}(u[k-i],u[k-j]) = \bar{f}_{ij}(u[k-i],u[k-j]) - g_{j,ij}(u[k-j]) - g_{i,ij}(u[k-i]) +\check{c}_{ij}. \end{aligned}$$
(12)

Obviously, the average of this new function is zero,

$$\begin{aligned} \sum _{m=1}^l \check{f}_{ij}(a_m,u[k-j])= \sum _{m=1}^l \check{f}_{ij}(u[k-i],a_m)=0. \end{aligned}$$
(13)

To make the average of \(\bar{f}_i\) equal to zero, let, for each \(1 \le i\le n\),

$$\begin{aligned} \nonumber&\check{f}_1(u[k-1]) = \bar{f}_1(u[k-1])+\sum _{i=2}^ng_{1,1i}(u[k-1]) -\underbrace{{1 \over l} \sum _{m=1}^l [\bar{f}_1(a_m)+ \sum _{i=2}^ng_{1,1i}(a_m)]}_{\check{c}_1}, \\ \nonumber&\check{f}_{n-1}(u[k-n+1]) = \bar{f}_{n-1}(u[k-n+1]) +\sum _{i=1}^{n-2}g_{(n-1),i(n-1)}( u[k-n+1]) \\ \nonumber&{\small +g_{(n-1),(n-1)n}(u[k-n+1]) -\underbrace{{1 \over l} \sum _{m=1}^l [\bar{f}_{n-1}(a_m)+ \sum _{i=1}^{n-2}g_{(n-1),i(n-1)}(a_m)+ g_{(n-1),(n-1)n}(a_m)]}_{\check{c}_{n-1}}}, \\&\check{f}_n(u[k-n])=\bar{f}_n(u[k-n])+\sum _{i=1}^{n-1}g_{n,in}(u[k-n]) -\underbrace{{1 \over l} \sum _{m=1}^l [\bar{f}_n(a_m)+ \sum _{i=1}^{n-1}g_{n,in}(a_m)]}_{\check{c}_n}. \end{aligned}$$
(14)

Since,

$$\begin{aligned} \sum _{m=1}^l \check{f}_i(a_m)=0,~\forall i \end{aligned}$$
(15)

by taking \(\check{c}=\bar{c}-\sum _{1\le i< j \le n} \check{c}_{ij}+\sum _{i=1}^n \check{c}_i\), it follows that the system (1) can be rewritten as

$$\begin{aligned} y[k]=\check{c}+\sum _{i=1}^n\check{f}_i(u[k-i])+ \sum _{1\le i<j \le n} \check{f}_{ij}(u[k-i], u[k-j]) +v[k],~k=1,2,\ldots ,Ll^n. \end{aligned}$$
(16)

This makes the representation unique. For each \(1\le i <j \le n\), \(m_i, m_j=1,\ldots ,l\) and \(s=1,2,\ldots ,L\), define the partial averages of the output,

$$ Z^{ij}_{m_im_js} = {1 \over {l^{n-2}}} \sum _{{\mathop {u[k-i]=a_{m_i},u[k-j]= a_{m_j}}\limits ^{t=1}}}^{l^n} y[(s-1)l^n+k] $$
$$\begin{aligned} \begin{array}{lll} Z^{ij}_{m_im_j\cdot } &{}=&{} {1 \over L} \sum \nolimits _{k=1}^L Z^{ij}_{m_im_js} \\ Z^{ij}_{m_i\cdot \cdot } &{}=&{} {1 \over l} \sum \nolimits _{m_j=1}^l Z^{ij}_{m_im_j\cdot } \\ Z^{ij}_{\cdot m_j\cdot } &{}=&{} {1 \over l} \sum \nolimits _{m_i=1}^l Z^{ij}_{m_im_j\cdot } \\ Z^{ij}_{\cdot \cdot \cdot } &{}=&{} {1 \over l} \sum \nolimits _{m_i=1}^l Z^{ij}_{m_i\cdot \cdot } ={1 \over l} \sum _{m_j=1}^l Z^{ij}_{\cdot m_j\cdot } \end{array} \end{aligned}$$
(17)

The subscript “dot” indicates that average has been taken with respect to this variable, e.g., \(Z^{ij}_{m_im_j \cdot }\) is the average of \(Z^{ij}_{m_im_j s}\) with respect to the last variable s.

To provide a physical interpretation of the above variables, let us focus on the system (16) with \(n=3\), \(l=2\) and the Galois sequence \(GF(2^3)\) as in (10) and (11). Within one period, it is clear that for any fixed column of \(U_{2^3}\), half of the entries have values at \(a_1\) and the other half are at \(a_2\). Further, it is straightforward using (13) and (15) to show that for \(i=1\) and \(j=2\),

$$\begin{aligned} Z^{12}_{11s}&=\check{c}+\check{f}_1(a_1)+\check{f}_2(a_1)+ \check{f}_{12}(a_1,a_1) +(v[(s-1)2^3+1]+v[(s-1)2^3+8])/2, \\ Z^{12}_{12s}&= \check{c}+\check{f}_1(a_1)+\check{f}_2(a_2)+ \check{f}_{12}(a_1,a_2) +(v[(s-1)2^3+5]+v[(s-1)2^3+7])/2, \\ Z^{12}_{21s}&=\check{c}+\check{f}_1(a_2)+\check{f}_2(a_1)+ \check{f}_{12}(a_2,a_1) +(v[(s-1)2^3+2]+v[(s-1)2^3+6])/2, \\ Z^{12}_{22s}&=\check{c}+\check{f}_1(a_2)+\check{f}_2(a_2)+ \check{f}_{12}(a_2,a_2) +(v[(s-1)2^3+3] +v[(s-1)2^3+4])/2. \end{aligned}$$

Moreover,

$$\begin{aligned} Z^{12}_{11\cdot }&= \check{c}+\check{f}_1(a_1)+\check{f}_2(a_1)+ \check{f}_{12}(a_1,a_1) + {1 \over L} \sum _{s=1}^L(v[(s-1)2^3+1] +v[(s-1)2^3+8])/2, \\ Z^{12}_{12\cdot }&=\check{c}+\check{f}_1(a_1)+\check{f}_2(a_2)+ \check{f}_{12}(a_1,a_2) + {1 \over L} \sum _{s=1}^L(v[(s-1)2^3+2] +v[(s-1)2^3+7])/2, \\ Z^{12}_{1\cdot \cdot }&= \check{c}+\check{f}_1(a_1) +{1 \over {2L}} \sum _{s=1}^L\{(v[(s-1)2^3+2] + v[(s-1)2^3+7])/2 \\&\quad +(v[(s-1)2^3+1] +v[(s-1)2^3+8])/2\}, \\ Z^{12}_{\cdot \cdot \cdot }&=\check{c}+ {1 \over {4L}} \sum _{t=1}^{L2^3}v[k]. \end{aligned}$$

Clearly, an estimate \(\check{c}\) is obtained by \(Z^{12}_{\cdot \cdot \cdot }\) and an estimate \(\check{f}_1(a_1)\) is obtained by \(Z^{12}_{1\cdot \cdot } -Z^{12}_{\cdot \cdot \cdot }\). The results can be trivially but cumbersomely extended to the system (16) with any \(n \ge 2\), \(l \ge 2\) and ij as summarized in the following theorem.

Theorem 4

Consider the system (16) for any \(n \ge 2\), \(l\ge 2\) with the Galois input as in (11) and the variables defined in (17). Then, for any \(1\le i < j \le n\) and \(m_i,m_j=1,\ldots ,l\), we have

$$ Z^{ij}_{m_im_js}=\check{c}+\check{f}_i(a_{m_i})+\check{f}_j(a_{m_j})+ \check{f}_{ij}(a_{m_i},a_{m_j}) +\varepsilon ^{ij}_{m_im_js} $$

where \(\varepsilon ^{ij}_{m_im_js}\)’s are iid with zero mean and variance \(\sigma ^2/{l^{n-2}}\) and

$$\begin{aligned} Z^{ij}_{m_im_j\cdot }&=\check{c}+\check{f}_i(a_{m_i})+\check{f}_j(a_{m_j})+ \check{f}_{ij}(a_{m_i},a_{m_j}) +\frac{1}{L} \sum _{s=1}^L \varepsilon ^{ij}_{m_im_js}, \\ Z^{ij}_{m_i\cdot \cdot }&=\check{c}+\check{f}_i(a_{m_i})+ \frac{1}{lL} \sum _{m_j=1}^l\sum _{s=1}^L \varepsilon ^{ij}_{m_im_js}, \\ Z^{ij}_{\cdot m_j\cdot }&=\check{c}+\check{f}_j(a_{m_j})+ \frac{1}{lL} \sum _{m_i=1}^l\sum _{s=1}^L \varepsilon ^{ij}_{m_im_js}, \\ Z^{ij}_{\cdot \cdot \cdot }&= \check{c}+\frac{1}{ll^{n-2}L} \sum _{k=1}^{Ll^n}v[k]. \end{aligned}$$

Therefore, for a large L, very good estimates of \(\check{c}\), \(\check{f}_i\), and \(\check{f}_{ij}\) are available from \(Z^{ij}_{m_im_j\cdot }\), \(Z^{ij}_{m_i\cdot \cdot }\), \(Z^{ij}_{\cdot m_j\cdot }\), and \(Z^{ij}_{\cdot \cdot \cdot }\) that are computable from the input–output measurements. The implication of the above result is that the graph of \(\check{f}_i(a_{m_i})\) (\(\check{f}_j(a_{m_j})\)) versus \(a_{m_i}\) (\(a_{m_j}\)) is obtained by the graph of its estimate

$$\begin{aligned} \tilde{f}_i(a_{m_i})&=Z^{ij}_{m_i\cdot \cdot }-Z^{ij}_{\cdot \cdot \cdot } ~~\mathrm{vs}~~a_{m_i}~~ { \mathrm or}~~ \\ \tilde{f}_j(a_{m_j})&=Z^{ij}_{\cdot m_j\cdot }-Z^{ij}_{\cdot \cdot \cdot } ~~\mathrm{vs}~~a_{m_j} \end{aligned}$$

and the graph of \(\check{f}_{ij}(a_{m_i},a_{m_j})\) versus \((a_{m_i},a_{m_j})\) is obtained by \(\tilde{f}_{ij}(a_{m_i},a_{m_j})=(Z^{ij}_{m_im_j\cdot }-Z^{ij}_{m_i\cdot \cdot } -Z^{ij}_{\cdot m_j\cdot }+Z^{ij}_{\cdot \cdot \cdot })\) and

$$ \tilde{f}_{ij}(a_{m_i},a_{m_j}) ~~\mathrm{vs}~~(a_{m_i},a_{m_j}). $$

Accordingly, the contribution of \(\check{f}_i(a_{m_i})\) and \(\check{f}_{ij} (a_{m_i},a_{m_j})\) can be visually inspected by the graphs of \(\tilde{f}_i(a_{m_i})\) and \(\tilde{f}_{ij}(a_{m_i},a_{m_j})\). We make two comments here.

  • Structural estimation is similar to model validation in identification. One can never validate a model unless all possible inputs have been applied. This is clearly impossible in practice. In structural estimation, one can only say that the contribution of \(\check{f}_i(a_{m_i})\) or \(\check{f}_{ij}(a_{m_i},a_{m_j})\) is negligible with respect to the applied input. Therefore, the values \(a_1,\ldots ,a_l\) are important and have to be chosen judiciously.

  • In general, increasing the level l excites the system at more points and this is quite useful for nonlinear system identification. However, there is a balance between the number of levels l and the complexity of the implementation. For \(l=2\) or any binary input, the minimum length of the sequence to cover all possible n-tuple combinations is \(2^n\) and for an l level input, the minimum length becomes \(l^n\). Thus, the complexity increases quickly as l gets larger.

  • In general, a visual inspection works only for 2-factor terms.

5.2 Analysis of Variance (ANOVA)

The visual inspection approach discussed above is intuitive, efficient but Ad Hoc. If an estimate \(\tilde{f}_i\) is nonzero but small, it is hard to determine if the term should be retained or discarded because of noise. To make the idea mathematically rigorous, in this section, we develop a statistical hypothesis test based on the well-known analysis of variance (ANOVA) and F distribution tests. To this end we make an assumption.

Assumption 5.1

The noise \(v[\cdot ]\) is iid Gaussian with zero mean and variance \(\sigma ^2\).

The Gaussian assumption is needed for the mathematical derivation. However, it has been well documented in the literature [17] that ANOVA is quite robust against violation of the Gaussian assumption. Consider the system (16), the input (11), and the variables (17). Let, for each \(1\le i <j\le n\),

$$\begin{aligned} \begin{array}{lll} SS^{ij}_T &{}=&{} \sum \nolimits _{m_i=1}^l \sum \nolimits _{m_j=1}^l\sum \nolimits _{s=1}^L (Z^{ij}_{m_im_js} -Z^{ij}_{\cdot \cdot \cdot })^2 \\ SS^{ij}_{m_i\cdot } &{}=&{} \sum \nolimits _{m_i=1}^l lL (Z^{ij}_{m_i\cdot \cdot } -Z^{ij}_{\cdot \cdot \cdot })^2 \\ SS^{ij}_{\cdot m_j} &{}=&{} \sum \nolimits _{m_j=1}^l lL (Z^{ij}_{\cdot m_j\cdot } -Z^{ij}_{\cdot \cdot \cdot })^2 \\ SS^{ij}_{\cdot \cdot } &{}=&{} \sum \nolimits _{m_i=1}^l \sum \nolimits _{m_j=1}^lL (Z^{ij}_{m_im_j\cdot } -Z^{ij}_{\cdot m_j\cdot }-Z^{ij}_{m_i\cdot \cdot }+Z^{ij}_{\cdot \cdot \cdot })^2 \\ SS^{ij}_E &{}=&{} \sum \nolimits _{m_i=1}^l \sum \nolimits _{m_j=1}^l\sum \nolimits _{s=1}^L (Z^{ij}_{m_im_js} -Z^{ij}_{m_im_j\cdot })^2. \end{array} \end{aligned}$$
(18)

The following theorem can be shown by some algebraic manipulations and the Cochran Theorem [24].

Theorem 5

Consider the variables defined in (18). Then,

  • \(SS^{ij}_T=SS^{ij}_{m_i\cdot }+SS^{ij}_{\cdot m_j}+SS^{ij}_{ \cdot \cdot }+SS^{ij}_E\).

  • \(SS^{ij}_{m_i\cdot }\), \(SS^{ij}_{\cdot m_j}\), \(SS^{ij}_{\cdot \cdot }\), and \(SS^{ij}_E\) are statistically independent.

  • \({{l^{n-2}} \over \sigma ^2} SS^{ij}_E ~\sim \chi ^2(l^2(L-1))\) is \(\chi ^2\) distributed with \(l^2(L-1)\) degrees of freedom.

  • If \(\check{f}_{ij}(a_{m_i},a_{m_j})=0\) for all \(m_i,m_j=1,\ldots ,l\), then

    $$ {{l^{n-2}} \over \sigma ^2} SS^{ij}_{\cdot \cdot }~\sim \chi ^2((l-1)^2). $$
  • If \(\check{f}_i(a_{m_i})=0\) for all \(m_i=1,\ldots ,l\), then

    $$ {{l^{n-2}} \over \sigma ^2} SS^{ij}_{m_i\cdot }~\sim \chi ^2(l-1). $$
  • If \(\check{f}_j(a_{m_j})=0\) for all \(m_j=1,\ldots ,l\), then

    $$ {{l^{n-2}} \over \sigma ^2} SS^{ij}_{\cdot m_j}~\sim \chi ^2(l-1). $$

This theorem sets the foundation for the test of three null hypotheses,

  • \(H_{0ij}:~~\check{f}_{ij}(a_{m_i},a_{m_j})=0,~\forall a_{m_i},a_{m_j} =1,\ldots ,l\),

  • \(H_{0i\cdot }:~~\check{f}_i(a_{m_i})=0,~\forall a_{m_i}=1,\ldots ,l\),

  • \(H_{0\cdot j}:~~\check{f}_j(a_{m_j})=0,~\forall a_{m_j}=1,\ldots ,l\),

by the F-test because if \(H_{0ij}\) is true then

$$ T^{ij} = {{SS^{ij}_{\cdot \cdot }/(l-1)^2} \over {SS^{ij}_E/(l^2(L-1))}}~\sim F((l-1)^2,l^2(L-1)), \text{ for } \text{ all }~ 1\le i<j\le n, $$

is F-distributed with \((l-1)^2\) and \(l^2(L-1)\) degrees of freedom. Similarly, if \(H_{0i\cdot }\) is true,

$$ T^1 = {{SS^{12}_{m_i\cdot }/(l-1)} \over {SS^{12}_E/(l^2(L-1))}}~\sim F(l-1,l^2(L-1)) $$

and if \(H_{0\cdot j }\) is true, \(\forall j=2,\ldots ,n\),

$$ T^j = {{SS^{1j}_{\cdot m_j}/(l-1)} \over {SS^{1j}_E/(l^2(L-1))}}~\sim F(l-1,l^2(L-1)). $$

The null hypothesis \(H_{0ij}\) is rejected if \(T^{ij} > F_\alpha ((l-1)^2,l^2(L-1))\) where \(\alpha \) denotes the level of significance, usually in the range \(0.01-0.1\). The tests for \(H_{0i\cdot }\) and \(H_{0\cdot j}\) are similar. The results from the hypothesis tests are used to determine which \(f_i\) or \(f_{ij}\) should be retained with a certain confidence in probability.

6 Full Scale Identification

For full scale system identification, using the Galois sequence is not appropriate because the Galois sequence only excites the system at a finite points. We assume in this section that the input u[k] is an iid random sequence in a (unknown) open interval \(I \in R\) with a (unknown) probability density function \(\psi (\cdot )\). Then, the results of [3] can be used. Similar to the structural estimation case, the system (1) needs to be normalized for identification purposes. Let \(\mathbf{E}\) be the expectation operator. Define the partial averages,

$$\begin{aligned} c_{ij}&=\mathbf{E}\{ \bar{f}_{ij}(u[k-i],u[k-j])\}, \\c_1&=\mathbf{E}\{\bar{f}_1(u[k-1])+ \sum _{j=2}^n \mathbf{E}(\bar{f}_{1j}(u[k-1],u[k-j])~|~ u[k-1]=x_1)\}, \\c^1_i&=\mathbf{E}\{\bar{f}_i(u[k-i])+ \sum _{j=i+1}^n \mathbf{E}(\bar{f}_{ij}(u[k-i],u[k-j])~|~u[k-i]=x_i)\}, \\c^2_i&=\sum _{j=1}^{i-1} \mathbf{E}(\bar{f}_{ji}(u[k-j],u[k-i])~|~u[k-i]=x_i), \end{aligned}$$
$$\begin{aligned} c_n&=\mathbf{E}\{\bar{f}_n(u[j-n])+ \sum _{j=1}^{n-1} \mathbf{E}(\bar{f}_{jn}(u[k-j],u[k-n])~|~u[k-n]=x_n) \}. \end{aligned}$$

Now, for every \(x_i\) and \(x_j \in I\), define

$$\begin{aligned} f_{ij}(x_i,x_j)&= \bar{f}_{ij}(x_i,x_j) -\mathbf{E}( \bar{f}_{ij}(u[k-i],u[k-j])~|~u[k-j]=x_j) \nonumber \\&-\mathbf{E}( \bar{f}_{ij}(u[k-i],u[k-j])~|~u[k-i]=x_i) +c_{ij},~~1\le i<j\le n, \nonumber \\ f_1(x_1)&= \bar{f}_1(x_1)+ \sum _{j=2}^n \mathbf{E}(\bar{f}_{1j}(u[k-1],u[k-j])~|~ u[k-1]=x_1) -c_1, \nonumber \\ f_i(x_i)&= \bar{f}_i(x_i)+ \sum _{j=i+1}^n \mathbf{E}(\bar{f}_{ij}(u[k-i],u[k-j])~|~u[k-i]=x_i) \nonumber \\&+\sum _{j=1}^{i-1} \mathbf{E}(\bar{f}_{ji}(u[k-j],u[k-i])~|~u[k-i]=x_i) -c^1_i-c^2_i,~~ i=2,3,\ldots ,n-1, \nonumber \\ f_n(x_n)&= \bar{f}_n(x_n)+ \sum _{i=1}^{n-1} \mathbf{E}(\bar{f}_{in}(u[k-i],u[k-n])~| ~u[k-n]=x_n)-c_n. \end{aligned}$$
(19)

Next, with \( c=\bar{c} -\sum _{1\le i< j \le n}c_{ij}+\sum _{i=1}^nc_i\), \(c_i=c^1_i+c^2_i\), the system (1) can be written as

$$\begin{aligned} y[k] =c+\sum _{i=1}^nf_i(u[k-i]) + \sum _{1\le i<j \le n} f_{ij}(u[k-i], u[k-j]) +v[k],~k=1,2,\ldots ,N \end{aligned}$$
(20)

with

$$\begin{aligned} \mathbf{E}f_i(u[k-i])&=\mathbf{E}(f_{ij}(u[k-i], u[k-j])~|~u[k-i]=x_i)\\&=\mathbf{E}(f_{ij}(u[k-i], u[k-j])~|~u[k-j]=x_j)= 0. \end{aligned}$$

The problem is how to identify \(f_i\) and \(f_{ij}\). Observe that these variables are conditional expectations and thus can be calculated by empirical data easily, for instance using the kernel estimation method [3]. To this end, we define the kernel functions. A continuous, bounded and radially symmetric function \(K(\cdot )\) is said to be a kernel function if

$$\begin{aligned} K(z) = \left\{ \begin{array}{ll} >0, &{} z \in [-1,1] \\ 0, &{} z \not \in [-1,1]\\ \end{array} \right. ~~\text{ and }~~\int _{-1}^{1} K(z)dz=1. \end{aligned}$$
(21)

Now, the estimates of c, \(f_i\) and \(f_{ij}\) can be defined for each \(x_i,x_j \in I\) in which the input \(u[\cdot ]\) lies,

$$\begin{aligned} \hat{c}= {1 \over N}\sum _{k=1}^N y[k] \end{aligned}$$
(22)
$$ \hat{f}_i(x_i)= {{\sum _{k=1}^N K( \frac{x_i-u[k-i]}{\delta })y[k]} \over {\sum _{k=1}^N K( \frac{x_i-u[k-i]}{\delta })}}-\hat{c},~i=1,\ldots ,n $$
$$ \hat{f}_{ij}(x_i,x_j)= {{\sum _{k=1}^N K( \frac{\Vert (x_i,x_j)- (u[k-i], u[k-j])\Vert }{\delta })y[k]} \over {\sum _{k=1}^N K( \frac{\Vert (x_i,x_j)- (u[k-i], u[k-j])\Vert }{\delta })}}-\hat{f}_i(x_i) -\hat{f}_j(x_j)-\hat{c}, ~1\le i <j\le n $$

where \(\delta >0\) is the bandwidth. The following result, which is a standard exercise, follows from [3].

Theorem 6

Consider the system (3) with differentiable \(f_i\) and \(f_{ij}\), and any kernel function defined above. Then, for any \(x_i,x_j \in I\), provided that the input density function is positive at \(x_i,x_j\), i.e., \(\psi (x_i), \psi (x_j) >0\) and \(\delta \rightarrow 0\), \(\delta ^2N \rightarrow \infty \) as \(N \rightarrow \infty \), we have

$$ \hat{c} \rightarrow c $$
$$ \hat{f}_i(x_i) \rightarrow f_i(x_i) $$
$$ \hat{f}_{ij}(x_i,x_j) \rightarrow f_{ij}(x_i,x_j) $$

in probability as \(N \rightarrow \infty \).

7 Comparisons with Existing Methods

A new representation for a class of nonlinear nonparametric system has been proposed in (16). Further, structural estimation and full scale identification have been discussed in the previous section. Naturally, two questions arise. The first one is what are the advantages of the representation (16) as compared to some existing methods, in particular the fixed basis approach and the Volterra series? Second, even if one accepts the representation (16), why use the structural estimation and system identification techniques discussed in the previous section as compared to the traditional approach of identifying \(f(u[k-1],\ldots ,u[k-n])\) directly? We address these two issues in this section.

7.1 Relation with the Volterra Series

If the system (16) is smooth with an upper bound n on the time lag, its Volterra series is given by

$$ y[k]=h_0+\sum _{l=1}^n\sum _{i_1=0}^\infty \sum _{i_2=i_1}^\infty \cdots \sum _{i_l=i_{l-1}}^\infty h_l(i_1,\ldots ,i_l) \cdot u[k-i_1]u[k-i_2]\ldots u[k-i_l]+v[k]. $$

Two of the major advantages of the Volterra series are (1) it is in a closed form and (2) it is parametric. In other words, any smooth nonlinear nonparametric system can always be written in the above form. Further, identification becomes a linear estimation of the coefficients \(h_l\)’s. However, the Volterra series also has some disadvantages. In this work, we are mainly interested in verifying if the Volterra series is a good candidate for the system of short term memory and low degree of interaction as in (1) or (3). To this end, we need to understand the differences between a system of low degree of interaction and a system of low order in the classical sense. Traditionally, a system is said to be of low order if it can be written as or at least can be well approximated by a low-order multidimensional polynomial. For instance, a system is said to be first order if it is linear

$$ y[k]=f(u[k-1],\ldots ,u[k-n]) = c + \sum _{i=1}^n \alpha _i u[k-i] $$

or to be of second order if

$$ y[k]= c + \sum _{i=1}^n \alpha _i u[k-i] +\sum _{1 \le j_1 \le j_2 \le n} \gamma _{j_1j_2} u[k-j_1]u[k-j_2]. $$

Clearly, in both cases, the system is of 1-factor or 2-factor terms. In general, a system of low order in the traditional sense implies low degree of interaction. The other way around is however incorrect. For example, \(e^{u[k-1]}\) is an 1-factor term that is not necessarily of low order depending on the input magnitude. Also, \((u[k-1]u[k-2])^{10}\) is a 2-factor term which may not be approximated well by a second-order polynomial. Therefore, nonlinear systems of low order in the traditional sense are low degree interaction systems but the reverse implication is not necessarily true. Now, we consider a Volterra series approach. A second-order Volterra series is a model that contains all the first- and second-order kernels \(u[k-i]\)’s and \(u[k-j_1]u[k-j_2]\)’s. This model is a 2-factor interaction system. However, a 2-factor system \(y[k]=e^{u[k-1]} +(u[k-1]u[k-2])^{10}\) is definitely not represented well by a low-order Volterra series.

In summary, if a nonlinear system of short-term memory and low degree of interaction resembles the structure of a low-order multidimensional polynomial, the Volterra series is a good candidate. If the system is far away from a polynomial or the order of the polynomial is high, the Volterra series is not a good candidate simply because too many terms are needed to approximate the given system. In such a case, i.e., the unknown system is of low degree of interaction but not necessarily a low-order polynomial, the proposed representation is a vital choice. This observation is not surprising because the Volterra series is an extension of Taylor polynomial expansion of an analytic function. The advantages of the proposed representation for systems of short memory and low degree of interaction will be further illustrated in the simulation section.

7.2 Basis Function Approach

Without structural information, a fixed basis function approach is often used in nonlinear system identification. Typical basis functions are Fourier series, polynomials, and some orthogonal versions. Obviously, the success of a basis function approach relies on how much a priori information is available on the unknown structure. If the chosen basis functions resemble the structure of the unknown nonlinear system, only a few terms are needed to represent the unknown system. In this case, identification is likely to be successful. Otherwise, a fixed basis function approach requires a large number of terms which has a considerable negative effect on the identification step. The advantage of the proposed representation is that, if a nonlinear system has short-term memory and low degree of interaction which fits (3), then no additional structural information is required. In other words, there is no need to choose any basis functions and whether a chosen basis function resembles the unknown structure is no longer an issue.

7.3 Traditional One Shoot Kernel Approach

Once the representation of (1) or (3) is accepted, the second question is why to use the identification method proposed in the previous section and why not to identify the nonlinear function \(f(u[k-1],\ldots ,u[k-n])\) directly, which is a traditional approach. The difference is that the identification method proposed in this work decomposes a potentially high-dimensional nonlinear identification problem into a number of one- or two-dimensional problems. Since the method proposed in the work is kernel based, we compare it with the one shoot kernel based identification method.

First, for the one shoot kernel estimation of \(f(u[k-1],\ldots ,u[k-n])\) under iid inputs, the asymptotic convergence rate [12] is \(O(N^{-\frac{\alpha }{2\alpha +n}})\), where N is the total number of data points and \( \alpha \) depends on the choices of the kernel functions and the bandwidth. For the method proposed in the work, because identification is one or two dimensional, the asymptotic convergence rate is \(O(N^{-\frac{\alpha }{2\alpha +n}}|_{n=2})= O(N^{-\frac{\alpha }{2\alpha +2}})\) [12]. Thus, asymptotically, there is an advantage to use the proposed method.

Next, we consider the case that N is large but fixed. For nonlinear system identification, the curse of dimensionality is always a concern even for a modest n. We use similar arguments and examples as in [2] to illustrate the situation. Let \(u[\cdot ]\) be uniformly distributed in \(I=[-1, 1]\). Suppose one wants to estimate \(f(x_1,x_2,\ldots ,x_n)\) at a point \((x_1,x_2,\ldots ,x_n) \in I^n\). Since any nonparametric identification scheme, including the kernel approach, is in some form of local smoother or weighted average based on the measurement data in the neighborhood of \((x_1,x_2,\ldots ,x_n)\), there must be enough data in the neighborhood to average out the effects of noise and the uncertainty due to lack of structural information. For simplicity, suppose the neighborhood is a hyper-box with the side length 0.1. Then, the volume of \(I^n\) is \(2^n\) and the volume of the neighborhood is \(0.1^n\). This implies that the probability that a measurement data \((u[k-1],u[k-2],\ldots ,u[k-n])\) is in the neighborhood of \((x_1,x_2,\ldots ,x_n)\) is \((1/20)^n\) that goes to zero exponentially as n gets large. For a large N, there are likely \(N\cdot (1/20)^n\) measurements in the neighborhood. Unless N is huge, there is not enough data in a neighborhood for identification purpose even for a modest n. For the proposed method, however, the maximum dimension is two. The curse of dimensionality is not a problem. For instance, let \(n=8\). Then, the problem becomes identification of 8 1-factor terms \(f_j(u[k-j])\), \(j=1,2,\ldots ,8\), and 28 2-factor terms \(f_{j_1j_2}(u[k-j_1],u[k-j_2])\). Though the number of identification steps increases, the complexity of identification is reduced drastically. Because of decoupling, the probability of an \(u[k-j]\) in the neighborhood of \(x_j\) for one-dimensional identification is 0.05 and the probability of \((u[k-j_1],u[k-j_2])\) in the neighborhood of \((x_{j_1},x_{j_2})\) is 0.0025. Suppose that the total number of data points is \(N=10^5\). This implies that likely there are 5000 or 250 measurements in the neighborhood for identification of 1-factor or 2-factor terms, respectively. Recall that if the eight-dimensional \(f(x_1,\ldots ,x_8)\) is identified directly, the probability that a data vector is in the neighborhood of \((x_1,\ldots ,x_8)\) is \((1/20)^8\). With \(N=10^5\), the probability that there is one measurement in a neighborhood is \((1/2)^8\cdot 10^{-3}=\frac{1}{2^8 10^3}\) that makes identification nearly impossible. Clearly, the performance of identification of the 1-factor or 2-factor term can be substantially improved for the same N, compared to the identification of a eight-dimensional problem f. This effectively combats the curse of dimensionality.

8 Numerical Simulation

We now provide numerical simulation examples. We separate the discussions about random inputs and Galois sequence inputs.

Fig. 1
figure 1

\(\phi _j[k]=f_j(u[k-j])\)’s (solid) and their estimates \(\widehat{\phi }_j[k]\) (dashdot), \(j=1,2,3,4,5\)

Fig. 2
figure 2

\(\phi _6[k]\), \(\widehat{\phi }_6[k]\) and \(\phi _{10}[k]\), \(\widehat{\phi }_{10}[k]\)

Fig. 3
figure 3

\(\widehat{\phi }_j[k]\), \(j=7,8,9,11,12,13,14\) and 15

8.1 Random Inputs

Example 1

Consider a nonlinear system

$$ y[k]=f(u[k-1],u[k-2],u[k-3],u[k-4],u[k-5])+v[k] $$
$$ =\underbrace{1.25/3}_{\phi _0=c}+ \underbrace{u[k-1]}_{\phi _1=f_1}+ \underbrace{10\cdot u[k-2]^3}_{\phi _2=f_2}+ \underbrace{5\cdot u[k-3]^2-1.25/3}_{\phi _3=f_3} $$
$$ +\underbrace{0}_{\phi _4=f_4}+ \underbrace{0}_{\phi _5=f_5}+ +\underbrace{5 \cdot u[k-1]*u[k-2]}_{\phi _6=f_{12}}+ \underbrace{0}_{\phi _7=f_{13}}+ $$
$$ \underbrace{0}_{\phi _8=f_{14}}+ \underbrace{0}_{\phi _9=f_{15}} +\underbrace{0.5\cdot sin(2 \pi (u[k-2]+u[k-3]])}_{\phi _{10}=f_{23}} $$
$$\begin{aligned} +\underbrace{0}_{\phi _{11}=f_{24}}+ +\underbrace{0}_{\phi _{12}=f_{25}}+ +\underbrace{0}_{\phi _{13}=f_{34}}+ +\underbrace{0}_{\phi _{14}=f_{35}}+ +\underbrace{0}_{\phi _{15}=f_{45}}+ v[k] \end{aligned}$$
(23)

No prior structural information on f is available. The time lag of the system is unknown and only an upper bound of \(n=5\) is assumed. For simulation, \(N=20,000\) and \(\delta =0.1\). The input \(u[\cdot ]\) is independent and uniformly distributed in \([-0.5,0.5]\), and the noise \(v[\cdot ]\) is iid Gaussian with \(SNR=20\) dB.

Fig. 4
figure 4

Average error versus the estimation order

Table 1 Relative contributions for N \(=\) 20000, 10000, 5000 and \(d_1=0.03\), respectively

Figure 1 shows the actual but unknown \(\phi _j[k]\)(solid), \(j=1,...,5\) and their estimates \(\widehat{\phi }_j[k]\) (dashdot), \(j=1,...,5\), respectively. The top diagrams of Fig. 2 show \(\phi _6[k]\), \(\phi _{10}[k]\) superimposed with their estimates \(\widehat{\phi }_6[k]\), \(\widehat{\phi }_{10}[k]\). The estimation errors of \(\phi _6[k]-\widehat{\phi }_6[k]\) and \(\phi _{10}[k]-\widehat{\phi }_{10}[k]\) are in the bottom diagrams. The estimates \(\widehat{\phi }_j [k]\)’s, \(j=7,8,9,11,12,13,14\) and 15 are in Fig. 3. It can be seen that all the estimates fit the actual but unknown functions well.

Fig. 5
figure 5

Cumulative and relative contributions

Fig. 6
figure 6

Actual output (solid) and predicted output (dash-dot) for a fresh input

To determine the order of the estimation model, we calculate the residual and plot the average error as a function of the estimation order p as in the top diagram of Fig. 4. Obviously, there is a drastic reduction in the average error for the order \(p=10\) and there is a little change for \(p >10\). Thus, we take \(p=10\) and test if the order \(p=10\) is acceptable by the modified Box–Pierce test (9). When \(p=10\), \(Q_{n-1}=Q_4=5.6434\). Let the level of significance be 0.05. This corresponds to, from the \(\chi ^2(n-1) =\chi ^2(4)\) distribution table, the threshold \(d=9.4877\). Since \(Q_4=5.6434 < d=9.4877\). The order \(p=10\) is accepted which is in fact the actual but unknown order. The order determination can also be carried by the relative contribution \(R_c[p]\) shown in Table 1 as well as in the bottom diagram of Fig. 5. The cumulative contribution \(C_c[p]\) is shown in the top diagram of Fig. 5. To determine which term \(\widehat{\phi }_j\) should be included in the estimate, let the threshold \(d_1=0.03\). If \(\widehat{R}_c[j] \ge d_1\), we include the corresponding term \(\widehat{\phi }_j\) in the model. Otherwise the contribution of the corresponding term is deemed to be insignificant and omitted in the model. Clearly, from Table 1, only the terms \(\widehat{\phi }_0\), \(\widehat{\phi }_1\), \(\widehat{\phi }_2\), \(\widehat{\phi }_3\), \(\widehat{\phi }_6\) and \(\widehat{\phi }_{10}\) contribute significantly and should be included in the model. Simply put, the system time lag is determined to be \(n=3\), though the upper bound is assumed to be 5. Further, it is determined that the system contains only 6 terms, \(\phi _0=c\), \(\phi _1=f_1\), \(\phi _2=f_2\), \(\phi _3=f_3\), \(\phi _6=f_{12}\), and \(\phi _{10}=f_{23}\) and all other terms are zero. The conclusion is consistent with the true but unknown system.

Finally, to validate the obtained estimate \(\widehat{f}=\sum _{i=0,1,2,3,6,10} \widehat{\phi }_i [k]\), a fresh input

$$ u[k]=0.5 \sin (k/10)\cdot \cos (k/20),~k=1...,150 $$

is generated which is completely different from the white noise input that was used for identification. A standard goodness-of-fit criterion

$$\begin{aligned} (1-\sqrt{ {{\sum _k(y[k]-\widehat{y}[k])^2} \over {\sum _k(y[k]-\frac{1}{N} \sum _k y[k])^2}}}) \times 100\% \end{aligned}$$
(24)

is calculated. Based on the fresh input, the output y[k] of the actual but unknown nonlinear system (23) is generated as well as the predicted output \(\widehat{y}[k]\) based on the estimate

$$ \widehat{y}[k]=\widehat{f}(u[k-1],[u-2],u[k-3],u[k-4],u[k-5]) $$
$$ =\widehat{\phi }_0 +\widehat{\phi }_1[k]+ \widehat{\phi }_2[k]+ \widehat{\phi }_3[k] +\widehat{\phi }_6[k]+ \widehat{\phi }_{10}[k]. $$

Figure 6 shows the actual output y[k] (solid) and the predicted output \(\widehat{y}[k]\) (dash-dot) with the goodness-of-fit 0.9411, an almost perfect fit. This validates the effectiveness of the identification method proposed in the work along with its order determination and regressor selection.

8.2 Galois Sequence Inputs

In this subsection, we discuss two numerical examples that shed lights on the efficiency of the proposed representation and identification method using Galois sequence inputs in the context of existing methods.

Example 2

$$\begin{aligned} \begin{array}{lll} w[k]&{} =&{} u[k]-0.3u[k]^3 \\ x[k]&{} =&{} 0.3x[k-1]-0.02x[k-2]+0.5w[k-1]+0.4w[k-2] \\ y[k]&{}=&{}x[k]+0.4x[k]^2 +v[k] \end{array} \end{aligned}$$

The noise v[k] is an iid zero mean and unit variance Gaussian random variable multiplied by 0.2. The actual nonlinear system is IIR and therefore there are no exact \(f_i\) and \(f_{ij}\). We represent the system by (3) assuming that the maximum time lag \(n \le 8\). Note determination of the order of an unknown nonlinear system is an interesting and open problem which is out of scope of the work. Here we just assume that the upper bound \(n=8\) is available (admittedly it could be restrictive in some applications).

First, structural estimation is carried out by using a binary Galois sequence \(GF(2^8)\) with \(n=8, l=2\) and \(L=11\) and \(a_1=1\), \(a_2=0\). ANOVA was used to calculate \(T^{ij}\) and \(T^i\) as shown in Table  2 that are the averages of 50 Monte Carlo simulations.

Table 2 Calculated \(T^i\) and \(T^{ij}\) for polynomial input nonlinearity

For the hypothesis tests, we choose \(\alpha =0.1\). From the F distribution, we have \(F_{0.1}(1,40)=2.84\). By the F-tests, we have \(T^1, T^2,T^3,T^4,T^{12},T^{13},T^{23} >2.84\), and all other \(T^i,T^{ij} < 2.84\) as can be seen in Table 1. Thus, we reject the hypotheses that \(f_1\), \(f_2\), \(f_3\), \(f_4\), \(f_{12}\), \(f_{13}\), and \(f_{23}\) are negligible and assume that all other terms are zero. Second, these non-negligible terms are identified with iid input uniformly in \([-1.5,1.5]\), a triangle kernel [3] with \(\delta =0.4\) and the total number of data points \(N=5000\). Further, their estimates are used to construct the model

$$\begin{aligned} \hat{y}[k]&=\hat{c}+ \hat{f}_1(u[k-1])+ \hat{f}_2(u[k-2])+ \hat{f}_3(u[k-3])+\hat{f}_4(u[k-4]) \\&+\hat{f}_{12}(u[k-1],u[k-2])+\hat{f}_{13}(u[k-1],u[k-3])+\hat{f}_{23}(u[k-2],u[k-3]). \end{aligned}$$

To validate the model, the input is generated

$$ u[k]=1.5 \sin (k/10) \cos (k/20),~k=1,\ldots ,160 $$

as well as the corresponding actual outputs y[k] and predicted outputs \(\hat{y}[k]\)’s.

Figures 7, 8, 9, and 10 show y[k], \(\hat{y}[k]\)’s predicted by the proposed method, the Volterra series of fourth order, a fixed basis of polynomial upto the second order and the one shoot method respectively as well as their gof’s. Since the actual nonlinearity is a polynomial, the proposed method, the Volterra series, and the fixed basis of polynomial all perform satisfactory, significantly better than the one shoot method as expected. An overview of the performances is given in Table 3.

Example 3

$$ \begin{array}{lll} w[k]&{} =&{} u[k]-0.3u[k]^3 e^{1.4u[k]}\\ x[k]&{} =&{} 0.3x[k-1]-0.02x[k-2]+0.5w[k-1]+0.4w[k-2] \\ y[k]&{}=&{}x[k]+0.4x[k]^2 +v[k]. \end{array} $$

The only difference between Examples 2 and 3 is that the input nonlinearity now contains an exponential term. All other simulation conditions remain the same. \(T^i\) and \(T^{ij}\) for Example 3 are given in Table 4 for a binary test input \(GF(l^n)\) with \(n=8,l=2\) and \(L=11\).

With \(\alpha =0.1\) and by the F-test as shown in Table 4, only the terms \(f_1\), \(f_2\), \(f_3\), \(f_4\), \(f_5\), \(f_{12}\), \(f_{13}\), \(f_{14}\), \(f_{23}\), and \(f_{24}\) are not negligible and thus the model is given by

$$\begin{aligned} \hat{y}[k]&=\hat{c}+ \hat{f}_1(u[k-1]) \hat{f}_2(u[k-2])+ \hat{f}_3(u[k-3])+ \hat{f}_4(u[k-4])+\hat{f}_5(u[k-5])\\&+\hat{f}_{12}(u[k-1],u[k-2])+\hat{f}_{13}(u[k-1],u[k-3])+\hat{f}_{14}(u[k-1],u[k-4])\\&+\hat{f}_{23}(u[k-2],u[k-3])+\hat{f}_{24}(u[k-2],u[k-4]). \end{aligned}$$

Under the same validation input, the corresponding y[k] and predicted \(\hat{y}[k]\) by various methods are shown in Figs.  11, 12, 13 and 14. The corresponding gof’s are given in Table  5.

Fig. 7
figure 7

Actual y[k] and predicted \(\hat{y}[k]\) by the proposed method with gof \(=\) 0.9470 (polynomial nonlinearity)

Fig. 8
figure 8

Actual y[k] and predicted \(\hat{y}[k]\) by an fourth-order Volterra with gof \(=\) 0.9563 (polynomial nonlinearity)

Fig. 9
figure 9

Actual y[k] and predicted \(\hat{y}[k]\) by a second polynomial with gof \(=\) 0.8121 (polynomial nonlinearity)

Fig. 10
figure 10

Actual y[k] and predicted \(\hat{y}[k]\) by one shoot method with gof \(=\) 0.6762 (polynomial nonlinearity)

Table 3 Goodness-of-fits for the polynomial input nonlinearity
Table 4 \(T^i\) and \(T^{ij}\) for exponential nonlinearity
Fig. 11
figure 11

Actual y[k] and predicted \(\hat{y}[k]\) by the proposed method (exponential nonlinearity)

Fig. 12
figure 12

Actual y[k] and predicted \(\hat{y}[k]\) by a third-order Volterra (exponential nonlinearity)

Fig. 13
figure 13

Actual y[k] and predicted \(\hat{y}[k]\) by a third polynomial (exponential nonlinearity)

Fig. 14
figure 14

Actual y[k] and predicted \(\hat{y}[k]\) by one shoot method (exponential nonlinearity)

Table 5 Goodness-of-fits for the exponential input nonlinearity

The results of the second-, third-, fourth-, fifth-, and sixth-order Volterra series are also shown in Table 5 and Fig. 12, exhibiting a considerable performance deterioration. This is because a low-order polynomial approximation in \(u[\cdot ]\) like the Volterra series is inefficient to model an exponential function. This demonstrates the advantage of the proposed representation along with structural estimation and system identification for nonlinear nonparametric system of short-term memory and low degree of interaction. It is interesting to note that a higher order Volterra does not necessarily imply a better identification result because variance error also increases as the order gets high. The gofs of the fixed basis function for the second- and third-order polynomials are 0.2299 and 0.1659, respectively. Figure 13 demonstrates the corresponding y[k] and \(\hat{y}[k]\) for the fixed basis function approach of third order. Again, the performance of a fixed basis function approach depends on if the chosen functions resemble the unknown structure or not. The result of the one shoot kernel is shown in Fig. 14 with gof \(=\) 0.1679, a poor performance. The reason is that for a higher dimension \(n=8\), the bandwidth \(\delta \) has to be large or there is no data in the neighborhood that consequently increases the bias. In the simulation, bandwidth was carefully adjusted to find the best gof which is reported here. It is clear, for Example 3 which is of short-term memory and low-order interaction, the proposed method outperforms any other method.

9 Discussion

In this section, we provide discussions and try to shed some lights on the proposed method.

  • Orthogonalization and marginal influences: The essential step of the work is an orthogonalization procedure that allows us to write the output as a summation of marginal influences of the input variables. Then, these marginal influences are estimated by empirical averages weighted by a kernel function. This is related to the additive or generalized additive systems investigated in the statistics literature [12], especially discussed in a recent publication [26].

  • FIR and iid assumptions: The orthogonalization is achieved in the work by assuming iid inputs and FIR structure of the unknown nonlinear system. The iid assumption removes statistical correlations between input variables and makes orthogonalization easier. The iid condition is however not critical as long as the correlations between \(u[k-i]\)’s and \(u[k-j]\)’s are available so they can be canceled out in the orthogonalization procedure. On the other hand, the FIR assumption on the nonlinear system is critical. Without this assumption, the output y[k] is a function of the previous outputs \(y[k-i]\)’s as well as the input \(u[k-j]\)’s which are correlated. The exact correlation between \(y[k-i]\) and \(u[k-j]\) relies on the system to be identified. This makes cancelation of the correlations between the output variables and between the output and input variables very difficult. We are working along this direction and some preliminary results have been reported in [4].

  • Kernel estimator and the choice of the bandwidth: The kernel estimator (6) is a smooth version of a conditional mean. The unknown function is estimated by the empirical mean of the measurements in the neighborhood of the point to be estimated. The size of the neighborhood, referred to as the bandwidth \(\delta \), controls the number of measurements to be used. The idea is to represent the unknown nonlinearities locally. All measurements outside the neighborhood \(\varphi (k)> \delta \), are not used to construct the estimates. The choice of \(\delta \) balances the trade-off between the bias and the variance. A large \(\delta \) implies a large bandwidth interval and accordingly more data is used that results in a small variance. On the other hand, because more data points area used even with those not in a close vicinity, the approximation error gets large, which gives rise to a large bias term. A small \(\delta \) produces just the opposite, a large variance and a small bias. Hence, increasing \(\delta \) tends to reduce the variance but at the same time increases the bias. The best choice is to balance the bias and the variance. There is a huge literature on this topic and some guidelines are available in [12, 22, 26] for the choice of the bandwidth \(\delta \). For instance, the optimal bandwidth can be derived by minimizing the mean square error if the analytical expression exists. Alternatively, a data-driven bandwidth can be derived by using the leaving-one-out criterion. For details, see [12] and the references within.

  • Recursive algorithms: The kernel estimator proposed in the work can be calculated recursively when the new data become available. First, let \(\widehat{\phi }^{N+1}_0\) and \(\widehat{\phi }^N_0\) be the estimates of \(\phi _0\) at \(N+1\) and N, respectively, where the superscripts \(N+1\) and N emphasize on the dependence of the data upto \(N+1\) and N, respectively. It is easily verified that

    $$ \widehat{\phi }^{N+1}_0= \frac{N}{N+1}\widehat{\phi }^N_0+\frac{1}{N+1} \cdot y[N+1]. $$

    To calculate \(\widehat{\phi }^{N+1}_j(x_j)\) from \(\widehat{\phi }^N_j(x_j)\), \(j=1,2,...,n\), recursively, consider

    1. 1.

      Collect new data \(y[N+1], u[N]\) and calculate \(\varphi _j(x_j,N+1)=|u[N+1-j] -x_j|\).

    2. 2.

      If \(\delta \le \varphi _j(x_j,N+1)\), then

      $$ w_j^{N+1}(x_j,k)=\left\{ \begin{array}{cc} w_j^N(x_j,k), &{} k=1,2,...,N \\ 0, &{} k=N+1 \\ \end{array} \right. $$
    3. 3.

      \(\widehat{\phi }^{N+1}_j(x_j)=\widehat{\phi }^{N}_j(x_j)\). Reset \(N+1\Rightarrow N\) and go back to step 1.

    4. 4.

      If \(\delta > \varphi _j(x_j,N+1)\), let

      $$ \lambda (N+1)=\frac{l_j\delta -\sum _{i=1}^{l_j}{\varphi }_j(x_j,m_j(i))}{l_j\delta - \sum _{i=1}^{l_j} \varphi _j(x_j,m_j(i))+\delta - \varphi _j(x_j,N+1)} $$

      and define

      $$ w_j^{N+1}(x_j,k)= \left\{ \begin{array}{c} w_j^N(x_j,k)\cdot \lambda (N+1),~k\in M_j=\{m_j(1),...,m_j(l_j)\}\\ \frac{\delta -\varphi _j(x_j,N+1)}{(l_j+1)\delta -\sum _{i=1}^{l_j}\varphi _j(x_j,m_j(i))- \varphi _j (x_j,N+1)},~k=N+1 \\ 0, ~k\notin \{N+1, m_j(1),...,m_j(l_j)\} \\ \end{array} \right. $$

      Identify \(N+1=m_j(l_j+1)\).

    5. 5.

      \(\widehat{\phi }^{N+1}_j(x_j)= \widehat{\phi }^N_j(x_j) \cdot \lambda (N+1)+w_j^{N+1}(x_j,N+1)y(N+1).\) Reset \(l_j+1\Rightarrow l_j, N+1\Rightarrow N\) and go back to step 1.

    Other \(\widehat{\phi }_j\), \(j > n\), can be similarly calculated recursively.

  • Higher factor interactive term systems and computational complexity: This work focuses on the system upto 2-factor interactive terms. All the results can be extended to higher factor interactive term systems. We summarize the procedures for a 3-factor term system.

    Step 1: Consider the system (1). Define \(f_{j_1j_2j_3}\) which is the normalized \(\bar{f}_{j_1j_2j_3}\) so that the average is zero with respect to any \(x_i\) and \((x_i,x_j)\).

    Step 2: Redefine \(\bar{f}_{j_1j_2}\) by adding the original \(\bar{f}_{j_1j_2}\) to all the 2-factor terms with index \(j_1j_2\) resulting from the normalization of \(\bar{f}_{j_1j_2j_3}\). Normalize \(\bar{f}_{j_1j_2}\) to have \(f_{j_1j_2}\).

    Step 3: Redefine \(\bar{f}_j\) by adding the original \(\bar{f}_j\) to all the 1-factor terms with the index j resulting from the previous steps. Normalize \(\bar{f}_j\) to have \(f_j\). Also, adjust the constant term c.

    Then, the orthogonal functions \(\phi _j\)’s and their estimates \(\widehat{\phi }_j\)’s can be similarly defined. The estimates enjoy the same convergence properties as in the 2-factor term case.

    In theory, the procedure can be extended to any factor term system. However, the number of terms increases exponentially and so is the computational complexity. Practically, the method proposed in the work is more efficient for a low-order factor term system, say 2-factor or 3-factor interactive term systems with a modest time lag n.

  • Curse of dimensionality: A common feature of most nonlinear identification methods in the literature is to find directly the nonlinearity f representing the input–output relationship of the system. This amounts to solving a high-dimensional nonlinear identification problem directly and is usually difficult if n is not small. One of the main problems is the curse of dimensionality in nonparametric identification. To illustrate the situation, let \(u[\cdot ]\) be uniformly distributed in \(I=[-0.5,0.5]\). Suppose one wants to estimate \(f(x_1,x_2,...,x_n)\) at a point \((x_1,x_2,...,x_n) \in I^n\). Since any nonparametric identification scheme is in some form of local smoother or weighted average based on the measurement data in the neighborhood of \((x_1,x_2,...,x_n)\), there must be enough data in the neighborhood to average out the effects of noise and the uncertainty due to lack of structural information. For simplicity, suppose the neighborhood is a hyper-box with the side length 0.1. Then, the volume of \(I^n\) is \(1^n=1\) and the volume of the neighborhood is \(0.1^n\). This implies the probability that a measurement data \((u[k-1],u[k-2],...,u[k-n])\) is in the neighborhood of \((x_1,x_2,...,x_n)\) is \(0.1^n/1=0.1^n\) that goes to zero exponentially as the order or dimension n gets larger. Let N be the number of total data measurements. For a large N, it is likely there are \(N\cdot 0.1^n\) measurements in the neighborhood. Unless N is huge, there is not enough data in a neighborhood for identification purpose even for a modest n.

    Now, consider the proposed method for a low-order factor term system, say for an 2-factor term system. The aim of the method is not to estimate the high-dimensional f directly but to estimate the unknown interactive terms \(f_j\) and \(f_{j_1j_2}\) or the orthonormal functions \(\phi _j\)’s. Moreover, identification of each interactive term is decoupled with each other. This is very beneficial. For instance, let \(n=5\). Then, the problem becomes identification of five 1-dimensional 1-factor terms \(f_j(u[k-j])\), \(j=1,2...,5\), and ten 2-dimensional 2-factor terms \(f_{j_1j_2}(u[k-j_1],u[k-j_2])\), \(1\le j_1 <j_2 \le 5\). Though the number of identifications is increased, the complexity of identification is reduced drastically. Because of decoupling, the probability of an \(u[k-j]\) in the neighborhood of \(x_j\) for one-dimensional identification is \(0.1/1=0.1\) and the probability of \((u[k-j_1],u[k-j_2])\) in the neighborhood of \((x_{j_1},x_{j_2})\) is \(0.1^2/1=0.1^2\). Suppose the total number of data points is \(N=10^4\). This implies that it is likely there are \(10^3\) or \(10^2\) measurements in the neighborhood for identification of 1-factor or 2-factor terms, respectively. Recall that if the five-dimensional \(f(x_1,x_2,x_3,x_4,x_5)\) is identified directly, the probability that a data vector is in the neighborhood of \((x_1,x_2,x_3,x_4,x_5)\) is \(0.1^5\). With \(N=10^4\), the probability that there is one measurement in a neighborhood is 0.1. That makes that identification is nearly impossible in the presence of noise. Clearly, the performance of identification of the 1-factor or 2-factor term can be substantially improved for the same N, compared to the identification of a five-dimensional problem f. This effectively combats the curse of dimensionality. In a sense, the approach proposed here is to replace a difficult high-dimensional problem by a number of less-difficult and manageable low-dimensional problems.

  • Combined residual analysis and statistical test: A version of the Box–Pierce test is developed in the context of nonlinear system identification. The reason behind this choice is that traditional Box–Pierce tests do not work well if there is a nonlinear dependence and could give misleading conclusions [32]. The modified Box–Pierce test overcomes this problem. Moreover, any Box–Pierce test assumes that the null hypothesis is true and then tests based on a measured data set if the null hypothesis should be accepted with a given probability. It alone can never answer the question of the second type error as discussed in the work. The contribution of the work is to deal with this problem by combining the Box–Pierce test with residual analysis. This reasonably guarantees that the null hypothesis is true before the Box–Pierce test.

    In the Box–Pierce test and the residual analysis, the choices of the level of significance and other parameters are always tricky and subjective. Whether the level of significance 0.01 or 0.03 is enough is tightly connected to the intended purpose of the model. If prediction is the intended purpose, the identified model should be validated on a fresh data to verify if the identified model fulfills the intended purpose. It may take several iterations to have some good design parameters for a particular application.

  • Finite data performance: The proposed method is convergent. The convergence rate is \(O(\frac{1}{\sqrt{\delta ^2 N}})\) for a system upto 2-factor interactive terms and is \(O(\frac{1}{\sqrt{\delta ^l N}})\) for a system upto to l-factor interactive terms. Like most of nonlinear identification algorithms, the analysis of finite data performance of the proposed method is very hard to carried analytically. We provide numerical simulations to demonstrate the finite data performance in terms of robustness of the choices with respect to the data length N, the bandwidth \(\delta \), and the order determination. To see the effect of data length N on the order determination, the same example (23) was simulated under the same simulation conditions for \(N=\) 20000, 10000, and 5000 respectively. The results are in Table 1 and fairly consistent even N experiences a large variation from 5000 to 20000. To test the effects of the data length N and the bandwidth \(\delta \) on the obtained model, we use the goodness-of-fit (24) as an indicator. Table 6 shows goodness-of-fit for various N and \(\delta \). Again, the identified model, in terms of prediction error, is robust with respect to variations of design parameters N and \(\delta \).

Table 6 Goodness-of-fit as a function of N and \(\delta \)

10 Concluding Remarks

In this work, a data-driven orthogonal basis function approach is proposed for nonlinear system identification. The main advantage is that it eliminates the guessing works when there is a little priori information on the structure of the unknown system. Further the data driven basis functions are orthogonal and thus enjoy many preferable properties. We are working on extending the results presented in the work to IIR nonlinear systems.

In addition, methods are proposed for order determination and regressor selection. These topics are generally very hard for nonlinear system identification. The methods proposed have potential to be applicable to many nonlinear system identification schemes and we felt they deserve more studies.

Finally, two structure identification methods under deterministic inputs are proposed to estimate the structure of the system before a full scale system identification is performed. They can efficiently simplify the procedure of system identification.