1 Introduction

Parameter estimation plays an important role in controller design [13, 14] because the controller design of dynamic system is usually established on the premise that the parameters of dynamic systems are known [19, 28]. Compared with the linear systems, nonlinear systems are more extensive in engineering practice [6, 10], and they can be roughly divided into four categories: Hammerstein systems [36], Wiener systems [18], Hammerstein–Wiener systems and Wiener–Hammerstein systems [2, 35]. Recently, many identification algorithms have been developed for these nonlinear systems, such as the stochastic gradient (SG) algorithms [39], the expectation maximization algorithms and the iterative algorithms [4]. The SG algorithm updates the parameter estimates only through the latest input–output data at each sampling instant, and it does not need to compute the inverse matrix; thus, it has less computational loads. Its variants include the multi-innovation stochastic gradient algorithms [16, 42] and the gradient-based iterative algorithms [12].

The idea of the gradient-based identification algorithms is first to determine the search direction and then to calculate the step size for each sampling instant. Although the computational effort of the SG algorithm is small, its convergence rate is slow because of its zigzag search directions. In general, there are two methods to improve the convergence rates. One is to obtain the optimal direction at each sampling instant. For example, for control problems with undetermined final time, Hussu provided a conjugate-gradient method [20]. The other method is to compute a suitable step size at each sampling instant. For instance, Chen and Ding introduced a convergence index into a modified stochastic gradient(M-SG) algorithm to improve the convergence rate [7]. Ma et al. [30] studied a forgetting factor stochastic gradient (FF-SG) algorithm for Hammerstein systems with saturation and preload nonlinearities. Although the M-SG and FF-SG algorithms can improve the convergence rates, they also bring some issues, such as severe oscillation when the estimates of the parameters approach to the true values [24].

One may ask whether it is feasible to develop a modified SG algorithm, which can not only quickly estimate the parameters, but also decrease the variances of the estimation errors. For this sake, the Aitken method is introduced in this paper. The Aitken method is a sequence acceleration method, used for accelerating the convergence rate of sequences. It is efficient for accelerating the convergence rate of a sequence which is converging linearly. For example, Pavaloiu et al. [34] studied an Aitken–Newton iterative method for nonlinear equations, which is more competitive than some optimization methods with the same convergence order. Bumbariu [5] developed an improved Aitken acceleration method for solving nonlinear equations which computes the solutions of the nonlinear equations with fast convergence rates. The proposed approaches of this paper have some interesting features.

  1. 1.

    Using the key term separation method, which can transform a complex Hammerstein systems with piecewise linearity into a simplify regression model.

  2. 2.

    Studying an FF-SG algorithm for this nonlinear system, which can improve the convergence rate.

  3. 3.

    Developing an Aitken-based SG algorithm, which has quick convergence rates and small estimation error variances.

  4. 4.

    Extending the proposed methods to identify the Hammerstein systems with colored noise.

In summary, this paper is listed as follows. Section 2 introduces the Hammerstein model. Section 3 presents some SG algorithms. Section 4 studies the Aitken-based SG algorithm for the piecewise linear system with colored noise. In Sect. 5, two illustrative examples are provided. Section 6 gives the conclusions of this paper and the directions of future research.

2 The Hammerstein System with Piecewise Linearity

The piecewise linear system is a special kind of switching systems, which widely exists in engineering practice [27, 37]. Such a system can be used to model or approximately describe the processes with different gains in different input intervals, e.g., the systems of flight control, circuits and biology [26, 31].

Consider the Hammerstein system with piecewise linearity as follows:

$$\begin{aligned} A(\zeta )y(\tau )= & {} B(\zeta )f(q(\tau ))+v(\tau ), \end{aligned}$$
(1)

where \(q(\tau )\) is the input which is taken as a persistent excitation signal sequence with zero mean and unit variance, \(y(\tau )\) is the output, \(v(\tau )\) is a white noise with zero mean and variance \(\sigma ^2\), and a piecewise linearity \(f(q(\tau ))\) is shown in Fig. 1, which can be written as

$$\begin{aligned} f(q(\tau ))= \left\{ \begin{array}{ll} m_1q(\tau ), &{} \quad q(\tau )\geqslant 0,\\ m_2q(\tau )), &{} \quad q(\tau )<0,\\ \end{array}\right. \end{aligned}$$
(2)
Fig. 1
figure 1

The piecewise linearity

where the corresponding segment slopes are \(m_1\) and \(m_2\).

The polynomials \(A(\zeta )\) and \(B(\zeta )\) are expressed as

$$\begin{aligned} A(\zeta )= & {} 1+a_1\zeta ^{-1}+a_2\zeta ^{-2}+\cdots +a_n\zeta ^{-n},\\ B(\zeta )= & {} b_0+b_1\zeta ^{-1}+\cdots +b_{n-1}\zeta ^{1-n}. \end{aligned}$$

Since the piecewise linearity is expressed by two equations, the Hammerstein system may be illustrated by two models [3]. Then, the considered Hammerstein model is equivalent to a switching model [1]. It is well known that the identification for switching models is more challenging. In order to simplify the identification process, the key term separation method is introduced [8, 9].

Define a switching function,

$$\begin{aligned} s(\tau ):=s[q(\tau )]=\left\{ \begin{array}{ll}0, &{}\quad q(\tau )>0, \\ 1, &{}\quad q(\tau )\le 0. \end{array}\right. \end{aligned}$$

Then, the nonlinear part \(f(q(\tau ))\) of input is written as

$$\begin{aligned} f(q(\tau ))= & {} m_1s(-q(\tau ))q(\tau )+m_2s(q(\tau ))q(\tau ). \end{aligned}$$
(3)

The nonlinear model can be written as

$$\begin{aligned} A(\zeta )y(\tau )= & {} B(\zeta )m_1s(-q(\tau ))q(\tau )+B(\zeta )m_2s(q(\tau ))q(\tau )+v(\tau ). \end{aligned}$$
(4)

Define the information vector \(\varvec{{\chi }}(\tau )\) and the parameter vector \(\varvec{{\xi }}\) as

$$\begin{aligned} \varvec{{\chi }}(\tau )= & {} [-y(\tau -1), -y(\tau -2), \ldots ,-y(\tau -n), s(-q(\tau ))q(\tau ),s(-q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(-q(\tau -n+1))q(\tau -n+1),s(q(\tau ))q(\tau ),s(q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(q(\tau -n+1))q(\tau -n+1)]^{\mathrm{T}}\in {\mathbb R}^{3n}, \end{aligned}$$
(5)
$$\begin{aligned} \varvec{{\xi }}= & {} [a_1, a_2, \ldots , a_n, b_0m_1, b_1m_1,\ldots ,b_{n-1}m_1,b_0m_2,b_1m_2,\ldots ,b_{n-1}m_2]^{\mathrm{T}}\in {\mathbb R}^{3n}. \end{aligned}$$
(6)

Then, the nonlinear model can be simplified as a regression model:

$$\begin{aligned} y(\tau )=\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\xi }}+v(\tau ). \end{aligned}$$
(7)

The proposed algorithms in this paper are based on this identification model. Many identification methods are derived based on the identification models of dynamical systems [29, 32, 33], which can be used to estimate the parameters of bilinear systems [23, 38, 47, 48] , and can be applied to fields such as chemical process control systems. From Eq. (7), it can be seen that the parameters can be estimated by all the traditional identification algorithms in the cost of heavy computational demands [17].

Remark 1

In this paper, \(b_0\) is assumed to be equal to 1; otherwise, \(b_i\) cannot be separated from \(b_im_k\). Assume the parameter estimates are

$$\begin{aligned} \hat{\varvec{{\xi }}}=[\hat{a}_1, \hat{a}_2, \ldots , \hat{a}_n, \hat{m}_1, \hat{b}_1\hat{m}_1,\ldots ,\hat{b}_{n-1}\hat{m}_1,\hat{m}_2,\hat{b}_1\hat{m}_2,\ldots ,\hat{b}_{n-1}\hat{m}_2]^{\mathrm{T}} \end{aligned}$$

Once the parameter estimates have been obtained, we can get \(\hat{m}_1\) and \(\hat{m}_2\) first, and then, based on \(\hat{m}_1\) and \(\hat{m}_2\), we can get \(\hat{b}_i=\frac{\hat{b}_i\hat{m}_k}{\hat{m}_k}, i=1,\ldots ,n-1, k=1,2\).

3 Some Stochastic Gradient Algorithms

The SG algorithm can be realized online, which updates parameters according to the latest input–output data [11]. Therefore, it has less computational efforts. However, this algorithm has slow convergence rates. In this section, some modified SG algorithms will be investigated.

3.1 The Traditional Stochastic Gradient Algorithm

Define the cost function

$$\begin{aligned} J(\varvec{{\xi }})=[y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\xi }}]^2. \end{aligned}$$

Assume that the parameter estimates at time \(\tau \) are \(\hat{\varvec{{\xi }}}(\tau -1)\), the key of the SG algorithm is to get a better estimate \(\hat{\varvec{{\xi }}}(\tau )\) which satisfies

$$\begin{aligned} J(\hat{\varvec{{\xi }}}(\tau ))=[y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau )]^2 \leqslant J(\hat{\varvec{{\xi }}}(\tau -1))=[y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)]^2. \end{aligned}$$
(8)

\(\hat{\varvec{{\xi }}}(\tau )\) is obtained based on \(\hat{\varvec{{\xi }}}(\tau -1)\) and is written by

$$\begin{aligned} \hat{\varvec{{\xi }}}(\tau )=\hat{\varvec{{\xi }}}(\tau -1)+\lambda (\tau )\varvec{{\chi }}(\tau )[y(\tau ) -\varvec{{\chi }}^{\mathrm{T}}(t)\hat{\varvec{{\xi }}}(\tau -1)]. \end{aligned}$$
(9)

In order to keep (8) holding, substituting (9) into (8) gets

$$\begin{aligned} J(\hat{\varvec{{\xi }}}(\tau ))=\big \{y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )(\hat{\varvec{{\xi }}}(\tau -1) +\lambda (\tau )\varvec{{\chi }}(\tau )[y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)])\big \}^2. \end{aligned}$$

Keep \(\hat{\varvec{{\xi }}}(\tau -1)\) fixing and define \(J(\lambda (\tau ))\) as

$$\begin{aligned} J(\lambda (\tau ))=\big \{y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )(\hat{\varvec{{\xi }}}(\tau -1)+\lambda (\tau )\varvec{{\chi }}(\tau )[y(\tau ) -\varvec{{\chi }}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)])\big \}^2. \end{aligned}$$

Let

$$\begin{aligned} \frac{\partial J(\lambda (\tau ))}{\partial \lambda (\tau )}=\frac{\partial \big [y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )(\hat{\varvec{{\xi }}}(\tau -1) +\lambda (\tau )\varvec{{\chi }}(\tau )[y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)])\big ]^2}{\partial \lambda (\tau )}. \end{aligned}$$

Setting the above derivative equal to zero obtains

$$\begin{aligned} \lambda (\tau )=\frac{1}{\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\chi }}(\tau )}. \end{aligned}$$

Then, we can get the steepest descent algorithm

$$\begin{aligned} \hat{\varvec{{\xi }}}(\tau )= & {} \hat{\varvec{{\xi }}}(\tau -1)+\frac{\hat{\varvec{{\chi }}}(\tau )}{\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\chi }}(\tau )}\big (y(\tau )-{\varvec{{\chi }}}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)\big ). \end{aligned}$$
(10)

However, when \(\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\chi }}(\tau )\) is small, the correction items \(\frac{\hat{\varvec{{\chi }}}(\tau )}{\varvec{{\chi }}^{\mathrm{T}}(\tau ) \varvec{{\chi }}(\tau )}\big (y(\tau )-{\varvec{{\chi }}}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)\big )\) would be large, which leads to the steepest descent algorithm be divergent. With this in mind, we define

$$\begin{aligned} \lambda (\tau )=\rho +\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\chi }}(\tau ),\rho \geqslant 0 \end{aligned}$$

Then, we get the projection algorithm

$$\begin{aligned} \hat{\varvec{{\xi }}}(\tau )= & {} \hat{\varvec{{\xi }}}(\tau -1)+\frac{\hat{\varvec{{\chi }}}(\tau )}{\rho +\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\chi }}(\tau )}\big (y(\tau )-{\varvec{{\chi }}}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)\big ). \end{aligned}$$
(11)

Since \(\rho \) is a constant, the unchanged step size will make the estimate of the algorithm oscillate seriously when the estimates are closing to the true values. In order to solve this problem, we replace \(\rho \) by \(\lambda (\tau -1)\). Then, the SG algorithm to estimate the parameter \(\varvec{{\xi }}\) is listed as follows,

$$\begin{aligned} \hat{\varvec{{\xi }}}(\tau )= & {} \hat{\varvec{{\xi }}}(\tau -1)+\frac{\hat{\varvec{{\chi }}}(\tau )}{\lambda (\tau )}\big (y(\tau )-{\varvec{{\chi }}}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)\big ), \end{aligned}$$
(12)
$$\begin{aligned} {\varvec{{\chi }}}(\tau )= & {} [-y(\tau -1), -y(\tau -2), \ldots ,-y(\tau -n), s(-q(\tau ))q(\tau ),s(-q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(-q(\tau -n+1))q(\tau -n+1),s(q(\tau ))q(\tau ),s(q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(q(\tau -n+1))q(\tau -n+1)]^{\mathrm{T}}. \end{aligned}$$
(13)
$$\begin{aligned} \lambda (\tau )= & {} \lambda (\tau -1)+\Vert \hat{\varvec{{\chi }}}(\tau )\Vert ^{2}, \lambda (0)=1. \end{aligned}$$
(14)

Remark 2

Although the traditional SG algorithm has less computational efforts, it also brings some challenging issues, e.g., slow convergence rates, especially for systems with large number of unknown parameters.

3.2 Two Modified Stochastic Gradient Algorithms

In order to increase the convergence rates, two modified SG algorithms for the Hammerstein system are developed in this subsection. A forgetting factor SG (FF-SG) algorithm is first introduced,

$$\begin{aligned} \hat{\varvec{{\xi }}}(\tau )= & {} \hat{\varvec{{\xi }}}(\tau -1)+\frac{\hat{\varvec{{\chi }}}(\tau )}{\lambda (\tau )}\big (y(\tau )-{\varvec{{\chi }}}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1)\big ), \end{aligned}$$
(15)
$$\begin{aligned} {\varvec{{\chi }}}(\tau )= & {} [-y(\tau -1), -y(\tau -2), \ldots ,-y(\tau -n), s(-q(\tau ))q(\tau ),s(-q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(-q(\tau -n+1))q(\tau -n+1),s(q(\tau ))q(\tau ),s(q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(q(\tau -n+1))q(\tau -n+1)]^{\mathrm{T}}. \end{aligned}$$
(16)
$$\begin{aligned} \lambda (\tau )= & {} r \lambda (\tau -1)+\Vert \hat{\varvec{{\chi }}}(\tau )\Vert ^{2},\ \lambda (0)=1, \ 0.8 \leqslant r \leqslant 1. \end{aligned}$$
(17)

Remark 3

The FF-SG algorithm introduces a forgetting factor r in the step size [15, 43, 46], which will make the step size larger at each sampling instant. Therefore, the FF-SG algorithm has quicker convergence rates compared with the traditional SG algorithm.

Remark 4

Although the FF-SG algorithm can increase the convergence rates, it brings some challengings, such as large estimation error variances.

To make the variance of the estimation error smaller, another modified SG algorithm will be studied in the following, which is termed as the Aitken-based SG (A-SG) algorithm. Assume that the parameter estimate \(\hat{\varvec{{\xi }}}(\tau )\) converges to the true value \(\varvec{{\xi }}\), which means that

$$\begin{aligned} \lim _{\tau \rightarrow \infty }[\hat{\varvec{{\xi }}}(\tau )-{\varvec{{\xi }}}] =\lim _{\tau \rightarrow \infty }[\hat{\varvec{{\xi }}}(\tau -1)-{\varvec{{\xi }}}]. \end{aligned}$$
(18)

It is equivalent to

$$\begin{aligned} \lim _{\tau \rightarrow \infty }\frac{\hat{\xi }_{\varrho }(\tau )- {\xi }_{\varrho }}{\hat{\xi }_{\varrho }(\tau -1)- {\xi }_{\varrho }}={1}, \end{aligned}$$
(19)

where \({\xi }_{\varrho }\) is the \(\varrho \)th element in the parameter vector \({\varvec{{\xi }}}\), \(\varrho =1,2,\ldots ,3n\). When \(\tau \) is large enough, the equivalent expression of (19) can be written as

$$\begin{aligned} \frac{\hat{\xi }_{\varrho }(\tau )-{\xi }_{\varrho }}{\hat{\xi }_{\varrho }(\tau -1)-{\xi }_{\varrho }} =\frac{\hat{\xi }_{\varrho }(\tau -1)-{\xi }_{\varrho }}{\hat{\xi }_{\varrho }(\tau -2)-{\xi }_{\varrho }}. \end{aligned}$$
(20)

From (19) and (20), it follows that

$$\begin{aligned} {[}\hat{\xi }_{\varrho }(\tau )+\hat{\xi }_{\varrho }(\tau -2)-2\hat{\xi }_{\varrho }(\tau -1)]{\xi }_{\varrho } = \hat{\xi }_{\varrho }(\tau )\hat{\xi }_{\varrho }(\tau -2)-\hat{\xi }^2_{\varrho }(\tau -1). \end{aligned}$$
(21)

Then, the Aitken accelerated iteration formula for \({\xi }_{\varrho }\) can be written as

$$\begin{aligned} {\xi }_{\varrho }=\hat{\xi }_{\varrho }(\tau -2) -\frac{\big (\hat{\xi }_{\varrho }(\tau -1)-\hat{\xi }_{\varrho }(\tau -2)\big )^2}{\hat{\xi }_{\varrho }(\tau ) +\hat{\xi }_{\varrho }(\tau -2)-2\hat{\xi }_{\varrho }(\tau -1)}. \end{aligned}$$
(22)

However, the parameter \(\varvec{{\xi }}\) cannot be computed by Eq. (18) because it is not a scalar, but a vector. In order to get the vector, the parameter \(\varvec{{\xi }}\) is rewritten as

$$\begin{aligned} \varvec{{\xi }}=[a_1, a_2, \ldots , a_n, m_1, b_1m_1,\ldots ,b_{n-1}m_1,m_2,b_1m_2,\ldots ,b_{n-1}m_2]^{\mathrm{T}} \end{aligned}$$

Then, Eq. (21) is equivalent to the 3n equations as follows,

$$\begin{aligned}&[\hat{a}_{i}(\tau )+\hat{a}_{i}(\tau -2)-2\hat{a}_{i}(\tau -1)]{a}_{i}= \hat{a}_{i}(\tau )\hat{a}_{i}(\tau -2)-\hat{a}^2_{i}(\tau -1), \quad i=1,\ldots ,n,\\&[\hat{m}_k(\tau )+\hat{m}_k(\tau -2)-\hat{m}_k(\tau -1)]m_k= \hat{m}_k(\tau )\hat{m}_k(\tau -2)-\hat{m}^2_k(\tau -1), \quad k=1,2, \\&[\hat{b}_{j}(\tau )\hat{m}_k(\tau )+\hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2)-2\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)]b_j m_k\\&\quad =\hat{b}_{j}(\tau )\hat{m}_k(\tau )\hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2)-\hat{b}^2_{j}(\tau -1)\hat{m}^2_k(\tau -1), \quad j=1,\ldots ,n-1. \end{aligned}$$

Then, we have

$$\begin{aligned} {a}_{i}= & {} \hat{a}_{i}(\tau -2)-\frac{(\hat{a}_{i}(\tau -1)-\hat{a}_{i}(\tau -2))^2}{\hat{a}_{i}(\tau )+\hat{a}_{i}(\tau -2)-2\hat{a}_{i}(\tau -1)},\\ {m}_{k}= & {} \hat{m}_{k}(\tau -2)-\frac{(\hat{m}_{k}(\tau -1)-\hat{m}_{k}(\tau -2))^2}{\hat{m}_{k}(\tau )+\hat{m}_{k}(\tau -2)-2\hat{m}_{k}(\tau -1)},\\ b_j m_k= & {} \hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2)\\&-\frac{(\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)-\hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2))^2}{\hat{b}_{j}(\tau )\hat{m}_k(\tau )+\hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2)-2\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)}. \end{aligned}$$

Define

$$\begin{aligned} \bar{a}_{i}(\tau )= & {} \hat{a}_{i}(\tau -2)-\frac{(\hat{a}_{i}(\tau -1)- \hat{a}_{i}(\tau -2))^2}{\hat{a}_{i}(\tau )+\hat{a}_{i}(\tau -2)-2\hat{a}_{i}(\tau -1)},\\ \bar{m}_{k}(\tau )= & {} \hat{m}_{k}(\tau -2)-\frac{(\hat{m}_{k}(\tau -1) -\hat{m}_{k}(\tau -2))^2}{\hat{m}_{k}(\tau )+\hat{m}_{k}(\tau -2)-2\hat{m}_{k} (\tau -1)},\\ \bar{b}_j(\tau ) \bar{m}_k(\tau )= & {} \hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2)\\&-\frac{(\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)-\hat{b}_{j}(\tau -2) \hat{m}_k(\tau -2))^2}{\hat{b}_{j}(\tau )\hat{m}_k(\tau )+\hat{b}_{j}(\tau -2) \hat{m}_k(\tau -2)-2\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)}. \end{aligned}$$

The Aitken-based SG algorithm is obtained as follows,

$$\begin{aligned} \bar{a}_{i}(\tau )= & {} \hat{a}_{i}(\tau -2)-\frac{(\hat{a}_{i}(\tau -1) -\hat{a}_{i}(\tau -2))^2}{\hat{a}_{i}(\tau )+\hat{a}_{i}(\tau -2) -2\hat{a}_{i}(\tau -1)}, \end{aligned}$$
(23)
$$\begin{aligned} \bar{m}_{k}(\tau )= & {} \hat{m}_{k}(\tau -2) -\frac{(\hat{m}_{k}(\tau -1)-\hat{m}_{k}(\tau -2))^2}{\hat{m}_{k}(\tau ) +\hat{m}_{k}(\tau -2)-2\hat{m}_{k}(\tau -1)}, \end{aligned}$$
(24)
$$\begin{aligned} \bar{b}_j(\tau )\bar{m}_k(\tau )= & {} \hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2)\nonumber \\&-\frac{(\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)-\hat{b}_{j}(\tau -2) \hat{m}_k(\tau -2))^2}{\hat{b}_{j}(\tau )\hat{m}_k(\tau )+\hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2) -2\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)},\nonumber \\ \end{aligned}$$
(25)
$$\begin{aligned} \hat{\varvec{{\xi }}}(\tau )= & {} \hat{\varvec{{\xi }}}(\tau -1) +\frac{\varvec{{\chi }}(\tau )}{\lambda (\tau )}e(\tau ),\ \end{aligned}$$
(26)
$$\begin{aligned} e(\tau )= & {} y(\tau )-\varvec{{\chi }}^{\mathrm{T}}(\tau )\hat{\varvec{{\xi }}}(\tau -1), \end{aligned}$$
(27)
$$\begin{aligned} {\varvec{{\chi }}}(\tau )= & {} [-y(\tau -1), -y(\tau -2), \ldots ,-y(\tau -n), s(-q(\tau ))q(\tau ),\nonumber \\&s(-q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(-q(\tau -n+1))q(\tau -n+1),s(q(\tau ))q(\tau ), s(q(\tau -1))q(\tau -1),\ldots ,\nonumber \\&\ s(q(\tau -n+1))q(\tau -n+1)]^{\mathrm{T}}, \end{aligned}$$
(28)
$$\begin{aligned} \lambda (\tau )= & {} \lambda (\tau -1)+\varvec{{\chi }}^{\mathrm{T}}(\tau )\varvec{{\chi }}(\tau ). \end{aligned}$$
(29)

The A-SG algorithm starts the iterations as follows.

  1. 1.

    To initialize: Let \(\tau =1\), \(\hat{\varvec{{\xi }}}(0)=\mathbf{1}_{3n}/p_0\), \(p_0=10^6\) and \(\lambda (0)=1\).

  2. 2.

    Let \(y(\tau )=0, q(\tau )=0\), \(\tau \leqslant 0\), and give an error tolerance number \(\varepsilon \).

  3. 3.

    Collect the input–output data \(\{q(\tau ), y(\tau )\}\).

  4. 4.

    Form \({\varvec{{\chi }}}(\tau )\) by (28).

  5. 5.

    Compute \(e(\tau )\) and \(\lambda (\tau )\) by (27) and (29), respectively.

  6. 6.

    Update the estimation vector \(\hat{\varvec{{\xi }}}(\tau )\) by (26).

  7. 7.

    Compute each estimate \(\bar{a}_{i}(\tau ), i=1,\ldots , n\), \(\bar{m}_{k}(t), k=1,2\) and \(\bar{b}_j(\tau )\bar{m}_k(\tau ), j=1, \ldots , n-1\) by (23)–(25), and then form \(\bar{\varvec{{\xi }}}(\tau )\).

  8. 8.

    Compare \(\bar{\varvec{{\xi }}}(\tau )\) and \(\bar{\varvec{{\xi }}}(\tau -1)\): if \(\Vert \bar{\varvec{{\xi }}}(\tau )-\bar{\varvec{{\xi }}}(\tau -1)\Vert \leqslant \varepsilon \), then obtain the \(\bar{\varvec{{\xi }}}(\tau )\) and go to the next step; otherwise, increase \(\tau \) by 1 and go to step 3.

  9. 9.

    Compute \(\bar{m}_k(\tau )\) first, and then calculate \(\bar{b}_i(\tau )=\frac{\bar{b}_i(\tau )\bar{m}_k(\tau )}{\bar{m}_k(\tau )}\).

Remark 5

The A-SG algorithm utilizes three connected parameter estimates to obtain an optimal parameter estimate, which does not use a large step size to speed up the convergence. Therefore, the A-SG algorithm has quicker convergence rates but smaller estimation error variances.

4 The Identification for the Hammerstein Piecewise Linearity System with Colored Noise

In this part, the SG algorithms are developed to identify the Hammerstein system with colored noise, which contains unmeasurable noise variables in the information vector.

4.1 Problem Description and Identification Model

Consider the Hammerstein piecewise linearity system with colored noise as follows,

$$\begin{aligned} y(\tau )=\frac{B(\zeta )}{A(\zeta )}\bar{x}(\tau )+\frac{D(\zeta )}{A(\zeta )}v(\tau ), \end{aligned}$$
(30)

where \(D(\zeta ):=1+d_1\zeta ^{-1}+d_2\zeta ^{-2}+\cdots +d_n\zeta ^{-n_d}\), the definitions of \(A(\zeta ),B(\zeta )\) and the piecewise linearity part are the same as those in Section 2.

By utilizing the key term separation technique, the system can be transformed into

$$\begin{aligned} A(\zeta )y(\tau )= & {} B(\zeta )m_1s(-q(\tau ))q(\tau ) +B(\zeta )m_2s(q(\tau ))u(\tau )+D(\zeta )v(\tau ). \end{aligned}$$
(31)

Then, the system is written by

$$\begin{aligned} y(\tau )= & {} -\sum \limits _{i=1}^{n}{{a_i}y(\tau -i)}+m_1s(-q(\tau ))q(\tau ) +\sum \limits _{i=1}^{n-1}{m_1b_is(-q(\tau -i))q(\tau -i)}\nonumber \\&+m_2s(q(\tau ))q(\tau )+\sum \limits _{i=1}^{n-1}{m_2b_is(q(\tau -i))q(\tau -i)} -\sum \limits _{i=1}^{n_d}{{d_i}v(\tau -i)}+v(\tau ).\nonumber \\ \end{aligned}$$
(32)

Define the information vector \(\varvec{{\psi }}(\tau )\) and the parameter vector \(\varvec{{\vartheta }}\) as

$$\begin{aligned} \varvec{{\psi }}(\tau ):= & {} [-y(\tau -1), -y(\tau -2), \ldots ,-y(\tau -n), s(-q(\tau ))q(\tau ),\nonumber \\&\ s(-q(\tau -1))q(\tau -1),\ldots ,s(-q(\tau -n+1))q(\tau -n+1),s(q(\tau ))u(\tau ),\nonumber \\&\ s(q(\tau -1))q(\tau -1),\ldots ,s(q(\tau -n+1))q(\tau -n+1),\nonumber \\&v(\tau -1),v(\tau -2),\ldots ,v(\tau -n_d),]^{\mathrm{T}}\in {\mathbb R}^{3n+n_d}, \end{aligned}$$
(33)
$$\begin{aligned} \varvec{{\vartheta }}:= & {} [a_1, a_2, \ldots , a_n, m_1, b_1m_1,\ldots ,b_{n-1}m_1,m_2,b_1m_2,\ldots ,b_{n-1}m_2,\nonumber \\&d_1, d_2, \ldots , d_{n_d}]^{\mathrm{T}}\in {\mathbb R}^{3n+n_d}. \end{aligned}$$
(34)

Then, the nonlinear system can be expressed as a simple form,

$$\begin{aligned} y(\tau )=\varvec{{\psi }}^{\mathrm{T}}(\tau )\varvec{{\vartheta }}+v(\tau ). \end{aligned}$$
(35)

4.2 The Aitken Stochastic Gradient Algorithm

Since the information vector in the Hammerstein piecewise linearity system with colored noise contains the unmeasured noise variables \(v(\tau -i)\), we denote \(\hat{v}(\tau )\) and \(\hat{\varvec{{\psi }}}(\tau )\) as the estimates of the \(v(\tau )\) and \(\varvec{{\psi }}(\tau )\) at time \(\tau \), respectively. Let \(\hat{\varvec{{\vartheta }}}(\tau )\) be the estimate of \(\varvec{{\vartheta }}\) at time \(\tau \) and define the innovation \(e(\tau )\) at time \(\tau \) as follows,

$$\begin{aligned} e(\tau ):=y(\tau )-{\hat{\varvec{{\psi }}}}^{\mathrm{T}}(\tau ){\hat{\varvec{{\vartheta }}}(\tau -1)}, \end{aligned}$$
(36)

where

$$\begin{aligned} \hat{\varvec{{\psi }}}(\tau )= & {} [-\hat{y}(\tau -1), -\hat{y}(\tau -2), \ldots , -\hat{y}(\tau -n), \hat{ s}(-q(\tau ))\hat{q}(\tau ),\nonumber \\&\hat{s}(-q(\tau -1))\hat{q}(\tau -1),\ldots , \hat{s}(-q(\tau -n+1))\hat{q}(\tau -n+1),\hat{s}(q(\tau ))\hat{q}(\tau ),\nonumber \\&\hat{s}(q(\tau -1))\hat{q}(\tau -1),\ldots , \hat{s}(q(\tau -n+1))\hat{q}(\tau -n+1),\nonumber \\&e(\tau -1),e(\tau -2),\ldots ,e(\tau -n_d)]^{\mathrm{T}}\in {\mathbb R}^{3n+n_d}. \end{aligned}$$
(37)

Remark 6

Since the information vector \(\varvec{{\psi }}(\tau )\) contains the unmeasurable variables \(v(\tau -i)\), their estimates \(e(\tau -i)\) can be used to replace these unknown noise variables \(v(\tau -i)\) in the information vector.

By using the Aitken accelerated iteration technique, the Aitken SG (A-SG) algorithm for the Hammerstein system with colored noise is developed as follows,

$$\begin{aligned} \bar{a}_{i}(\tau )= & {} \hat{a}_{i}(\tau -2)-\frac{(\hat{a}_{i}(\tau -1) -\hat{a}_{i}(\tau -2))^2}{\hat{a}_{i}(\tau )+\hat{a}_{i}(\tau -2) -2\hat{a}_{i}(\tau -1)}, \end{aligned}$$
(38)
$$\begin{aligned} \bar{m}_{k}(\tau )= & {} \hat{m}_{k}(\tau -2)-\frac{(\hat{m}_{k}(\tau -1) -\hat{m}_{k}(\tau -2))^2}{\hat{m}_{k}(\tau )+\hat{m}_{k}(\tau -2) -2\hat{m}_{k}(\tau -1)}, \end{aligned}$$
(39)
$$\begin{aligned} \bar{b}_j(\tau )\bar{m}_k(\tau )= & {} \hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2)\nonumber \\&-\frac{(\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1) -\hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2))^2}{\hat{b}_{j}(\tau )\hat{m}_k(\tau )+\hat{b}_{j}(\tau -2)\hat{m}_k(\tau -2) -2\hat{b}_{j}(\tau -1)\hat{m}_k(\tau -1)}, \end{aligned}$$
(40)
$$\begin{aligned} \bar{d}_{i}(\tau )= & {} \hat{d}_{i}(\tau -2)-\frac{(\hat{d}_{i}(\tau -1) -\hat{d}_{i}(\tau -2))^2}{\hat{d}_{i}(\tau )+\hat{d}_{i}(\tau -2) -2\hat{d}_{i}(\tau -1)}, \end{aligned}$$
(41)
$$\begin{aligned} \bar{\varvec{{\vartheta }}}(\tau )= & {} [\bar{a}_1(\tau ), \bar{a}_2(\tau ), \ldots , \bar{a}_n(\tau ), \bar{m}_1(\tau ), \bar{b}_1(\tau )\bar{m}_1(\tau ),\ldots , \bar{b}_{n-1}(\tau )\bar{m}_1(\tau ),\nonumber \\&\bar{m}_2(\tau ),\bar{b}_1(\tau )\bar{m}_2(\tau ),\ldots , \bar{b}_{n-1}(\tau )\bar{m}_2(\tau )]^{\mathrm{T}}, \end{aligned}$$
(42)
$$\begin{aligned} \hat{\varvec{{\vartheta }}}(\tau )= & {} \hat{\varvec{{\vartheta }}}(\tau -1) +\frac{\hat{\varvec{{\psi }}}(\tau )}{\lambda (\tau )}e(\tau ), \end{aligned}$$
(43)
$$\begin{aligned} e(\tau )= & {} y(\tau )-\hat{\varvec{{\psi }}}^{\mathrm{T}}(\tau )\hat{\varvec{{\vartheta }}}(\tau -1), \end{aligned}$$
(44)
$$\begin{aligned} \hat{\varvec{{\psi }}}(\tau )= & {} [-\hat{y}(\tau -1), -\hat{y}(\tau -2), \ldots ,-\hat{y}(\tau -n), \hat{ s}(-q(\tau ))\hat{q}(\tau ),\nonumber \\&\ \hat{s}(-q(\tau -1))\hat{q}(\tau -1),\ldots , \hat{s}(-q(\tau -n+1))\hat{q}(\tau -n+1),\nonumber \\&\ \hat{s}(q(\tau ))\hat{q}(\tau ),\hat{s}(q(\tau -1))\hat{q}(\tau -1), \ldots ,\nonumber \\&\hat{s}(q(\tau -n+1))\hat{q}(\tau -n+1),e(\tau -1),e(\tau -2),\ldots ,e(\tau -n_d)]^{\mathrm{T}}, \end{aligned}$$
(45)
$$\begin{aligned} \lambda (\tau )= & {} \lambda (\tau -1)+\hat{\varvec{{\psi }}}^{\mathrm{T}}(\tau )\hat{\varvec{{\psi }}}(\tau ). \end{aligned}$$
(46)

The flowchart of the A-SG algorithm is presented in Fig. 2. The proposed methods in this paper can combine other identification methods [40, 41] to study the parameter estimation problems of different systems with colored noises such as nonlinear systems [21, 22] and can be applied to other studies such as signal modeling and communication networked systems.

Fig. 2
figure 2

The flowchart of the A-SG algorithm for computing \(\bar{\varvec{{\vartheta }}}(\tau )\)

5 Numerical Examples

Example 1

Consider the following Hammerstein model,

$$\begin{aligned} A(\zeta )y(\tau )= & {} B(\zeta )f(q(\tau ))+v(\tau ),\\ y(\tau )= & {} -a_1y(\tau -1)-a_2y(\tau -2)+f(q(\tau ))+b_1f(q(\tau -1))+v(\tau )\\= & {} -0.15y(\tau -1)-0.46y(\tau -2)+f(q(\tau ))+0.9f(q(\tau -1))+v(\tau ),\\ f(q(\tau ))= & {} \bigg \{\begin{array}{ll} 0.3q(\tau ), &{} 0\le q(\tau ),\\ 0.2q(\tau ), &{}~q(\tau )\le 0,\\ \end{array} \\ \varvec{{\xi }}= & {} [a_1,a_2,m_1,b_1m_1,m_2,b_1m_2]^{\mathrm{T}}=[0.15,0.46,0.3,0.27,0.2,0.18]^{\mathrm{T}},\\ \varvec{{\chi }}(\tau )= & {} [-y(\tau -1),-y(\tau -2),s(-q(\tau ))q(\tau ),s(-q(\tau -1))q(\tau -1),s(q(\tau ))q(\tau ),\\&s(q(\tau -1))q(\tau -1)]^{\mathrm{T}}, \end{aligned}$$

where \(\{v(\tau )\}\) is taken as a white noise sequence with zero mean and variance \(\sigma ^2=0.10^2\), and \(\{q(\tau )\}\) is an input sequence with zero mean and unit variance.

The SG, the FF-SG and the A-SG algorithms are applied to estimate the parameters of the piecewise linear system. The estimation errors \(\delta :=\Vert \hat{\varvec{{\xi }}}-\varvec{{\xi }}\Vert /\Vert \varvec{{\xi }}\Vert \) or \(\delta :=\Vert \bar{\varvec{{\xi }}}-\varvec{{\xi }}\Vert /\Vert \varvec{{\xi }}\Vert \) versus \(\tau \) are shown in Fig. 3 and Tables 123. The means and variances of these three algorithms are given in Table 4.

Fig. 3
figure 3

The SG, FF-SG and A-SG estimation errors \(\delta \) versus \(\tau \) of Example 1

Example 2

Consider the following Hammerstein model with colored noise,

$$\begin{aligned} y(\tau )= & {} \frac{B(\zeta )}{A(\zeta )}\bar{x}(\tau ) +\frac{D(\zeta )}{A(\zeta )}v(\tau ),\\ y(\tau )= & {} -a_1y(\tau -1)-a_2y(\tau -2)+f(q(\tau ))\\&+\,b_1f(q(\tau -1))+v(\tau )+d_1v(\tau -1)\\= & {} -0.21y(\tau -1)-0.10y(\tau -2)+f(q(\tau ))\\&+\,0.5f(q(\tau -1))+v(\tau )-0.38v(\tau -1),\\ f(q(\tau ))= & {} \bigg \{\begin{array}{ll} 2.0q(\tau ), &{} \quad 0\le q(\tau ),\\ 1.4q(\tau ), &{}\quad q(\tau )\le 0,\\ \end{array}\\ \varvec{{\vartheta }}= & {} [a_1,a_2,m_1,b_1m_1,m_2,b_1m_2,d_1]^{\mathrm{T}}=[0.21,0.1,2.1,1.0,1.4,0.7,-0.38]^{\mathrm{T}},\\ \varvec{{\psi }}(\tau )= & {} [-y(\tau -1),-y(\tau -2),s(-q(\tau ))q(\tau ),s(-q(\tau -1))q(\tau -1),\\&s(q(\tau ))q(\tau ),s(q(\tau -1))q(\tau -1),v(\tau -1)]^{\mathrm{T}}, \end{aligned}$$

where \(\{v(\tau )\}\) is taken as a white noise sequence with zero mean and variance \(\sigma ^2=0.10^2\), and \(\{q(\tau )\}\) is an input sequence with zero mean and unit variance.

The SG, the FF-SG and the A-SG algorithms are applied to estimate the parameters of the piecewise linear system with colored noise, and the estimation errors \(\delta :=\Vert \hat{\varvec{{\vartheta }}}-\varvec{{\vartheta }}\Vert /\Vert \varvec{{\vartheta }}\Vert \) or \(\delta :=\Vert \bar{\varvec{{\vartheta }}}-\varvec{{\vartheta }}\Vert /\Vert \varvec{{\vartheta }}\Vert \) versus \(\tau \) are shown in Fig. 4.

Table 1 The SG algorithm estimates and errors
Table 2 The FF-SG algorithm estimates and errors
Table 3 The A-SG algorithm estimates and errors
Table 4 The means and variances of the parameter estimation errors
Fig. 4
figure 4

The SG, FF-SG and A-SG estimation errors \(\delta \) versus \(\tau \) of Example 2

From these two examples, we can get the following finds.

  1. 1.

    Tables 123 show that the FF-SG algorithm and the A-SG algorithm are better than the SG algorithm.

  2. 2.

    Figures 3 and 4 show that the estimation error curve of the FF-SG algorithm oscillates seriously when the errors converge to zero, but the estimation error curve of the A-SG algorithm is relatively smooth.

  3. 3.

    Table 4 shows that the A-SG algorithm is the most effective algorithm among these three algorithms.

  4. 4.

    The algorithms proposed in this paper can not only identify the Hammerstein system with white noise, but also the Hammerstein system with colored noise.

6 Conclusions

In this paper, some SG algorithms are proposed for Hammerstein systems with piecewise linearity. The key term separation method is used to transform the nonlinear model into a regression model. In order to accelerate the convergence rate of the SG algorithm, an FF-SG algorithm and an A-SG algorithm are studied. Compared with the FF-SG algorithm, the A-SG algorithm has almost the same estimation error mean but smaller estimation error variance. Therefore, the A-SG algorithm has a wider application prospect in system identification.

The purpose of this paper is to develop two accelerated SG algorithms for nonlinear systems. These methods can combine other identification algorithms, e.g., recursive least squares algorithm, expectation–maximization algorithm, to study the parameter estimation issues of time-delay systems, switching systems and neural network learning systems [25, 44, 45].