1 Introduction

Parameter estimation algorithms are often obtained through minimizing a criterion function. The gradient search, least squares search and Newton search are the useful tools for solving nonlinear optimization problems [15, 23, 4446]. Nonlinearities exist widely in industrial processes [21]. Typical nonlinear systems are the block-oriented systems, including input nonlinear systems [25, 30, 38, 42], output nonlinear systems [11, 41] or Wiener nonlinear systems [9], input–output (i.e., Hammerstein–Wiener) nonlinear systems [2, 31] and feedback nonlinear systems [14]. When the static nonlinear part of the block-oriented systems can be expressed as a linear combination of the known basis functions, the corresponding systems are the Hammerstein systems, Wiener systems and their combinations [16, 40]. A direct method of identifying the block-oriented nonlinear systems is the over-parametrization method [3]. By re-parameterizing the nonlinear systems, the output appears to be linear on the unknown parameter space so that any linear identification algorithms can be applied [4]. However, the resulting identification model contains the cross-products between the parameters in the nonlinear part and those in linear part, leading to estimate more parameters than the nonlinear system.

In the area of system identification, linear-in-parameter output error moving average systems are common, for which Wang and Tang [36] presented a recursive least squares estimation algorithm and discussed several gradient-based iterative estimation algorithms using the filtering technique [37]; Wang and Zhu [39] presented a multi-innovation parameter estimation algorithm. The system that includes the product terms of parameters is called the bilinear-in-parameter system. Bai and Liu [5] discussed the least squares solution of the normalized iterative method, the over- parametrization method and the numerical method for bilinear-in-parameter systems; Wang et al. [24] revisited the unweighted least squares solution and extended to identify the case of colored noise; Abrahamsson et al. [1] presented a two-stage method based on the approximation of a weighting matrix and discussed the applications to submarine detection. Other methods include the Kalman filtering-based identification approaches [10, 23].

The convergence of identification algorithms is a basic topic for system identification and attracts much attention. Recently, an auxiliary model-based recursive least squares algorithm and an auxiliary model-based hierarchical gradient algorithm have been proposed for dual-rate state space systems [12] and for multivariable Box–Jenkins systems using the data filtering [3234]. The modeling and multi-innovation parameter identification has been proposed for Hammerstein nonlinear state space systems using the filtering technique [35]; a recursive parameter and state estimation algorithm has been proposed for an input nonlinear state space system using the hierarchical identification principle [29]; an auxiliary model-based gradient algorithm has been reported for the time-delay systems by transforming the input–output representation into a regression model and its convergence was studied [13]. The convergence analysis of the hierarchical least squares algorithm has been analyzed for bilinear-in-parameter systems [26]. On the basis of the work in [26], this paper derives a hierarchical stochastic gradient (HSG) algorithm for bilinear-in-parameter systems based on the decomposition idea and analyzes its performances.

The rest of this paper is organized as follows. Section 2 presents an HSG algorithm for bilinear-in-parameter systems. Section 3 analyzes the performance of the HSG algorithm. Section 4 provides an illustrative example to show that the proposed algorithm is effective. Finally, a brief summary of the main contents is given in Sect. 5.

2 System Description and the HSG Algorithm

Consider the following bilinear-in-parameter systems [5, 26],

$$\begin{aligned} y(t)=\varvec{a}^{\tiny \text{ T }}\varvec{F}(t)\varvec{b}+v(t), \end{aligned}$$
(1)

where y(t) is the system output, \(\varvec{F}(t) \in {\mathbb {R}}^{m\times n}\) is composed of available measurement data, v(t) is a white noise sequence with zero mean and finite variance \(\sigma ^2\) and \(\varvec{a}=[a_1, a_2,\ldots , a_m]^{\tiny \text{ T }}\in {\mathbb {R}}^m\) and \(\varvec{b}=[b_1, b_2, \ldots , b_n]^{\tiny \text{ T }}\in {\mathbb {R}}^n\) are the unknown parameter vectors to be estimated.

For the identification model in (1), assume that m and n are known, and \(y(t)=0\), \(v(t)=0\) for \(t\leqslant 0\). Note that for any pair \(\lambda \varvec{a}\), \(\varvec{b}/\lambda \) , the system in (1) has the identical input–output relationship, so the constant \(\lambda \) has to be fixed. Without generality, we adopt the following assumption.

Assumption 1

\(\lambda =\Vert \varvec{b}\Vert \), and the first element of \(\varvec{b}\) is positive, i.e., \(b_1>0\), where the norm of the vector \(\varvec{X}\) is defined by \(\Vert \varvec{X}\Vert ^2:=\mathrm{tr}[\varvec{X}\varvec{X}^{\tiny \text{ T }}]\).

Define the vector \({\varvec{\psi }}(t):=\varvec{F}(t)\varvec{b}\in {\mathbb {R}}^m\),   \({\varvec{\varphi }}(t):=\varvec{F}^{\tiny \text{ T }}(t)\varvec{a}\in {\mathbb {R}}^n\). Then Eq. (1) can be written as

$$\begin{aligned} y(t)={\varvec{\psi }}^{\tiny \text{ T }}(t)\varvec{a}+v(t), \end{aligned}$$
(2)

or

$$\begin{aligned} y(t)={\varvec{\varphi }}^{\tiny \text{ T }}(t)\varvec{b}+v(t). \end{aligned}$$
(3)

Define the following two cost functions:

$$\begin{aligned} J_1(\varvec{a}):= & {} \Vert y(t)-{\varvec{\psi }}^{\tiny \text{ T }}(t)\varvec{a}\Vert ^2, \\ J_2(\varvec{b}):= & {} \Vert y(t)-{\varvec{\varphi }}^{\tiny \text{ T }}(t)\varvec{b}\Vert ^2. \end{aligned}$$

Using the negative gradient search and minimizing \(J_1(\varvec{a})\) and \(J_2(\varvec{b})\), we obtain the estimates \(\hat{\varvec{a}}(t)\) of \(\varvec{a}\) in Subsystem (2) and \(\hat{\varvec{b}}(t)\) of \(\varvec{b}\) in Subsystem (3) at time t:

$$\begin{aligned} \hat{\varvec{a}}(t)= & {} \hat{\varvec{a}}(t-1)+\frac{{\varvec{\psi }}(t)}{r_1(t)} \left[ y(t)-{\varvec{\psi }}^{\tiny \text{ T }}(t)\hat{\varvec{a}}(t-1)\right] , \end{aligned}$$
(4)
$$\begin{aligned} r_1(t)= & {} r_1(t-1)+\Vert {\varvec{\psi }}(t)\Vert ^2,\ r_1(0)=1, \end{aligned}$$
(5)
$$\begin{aligned} \hat{\varvec{b}}(t)= & {} \hat{\varvec{b}}(t-1)+\frac{{\varvec{\varphi }}(t)}{r_2(t)} \left[ y(t)-{\varvec{\varphi }}^{\tiny \text{ T }}(t)\hat{\varvec{b}}(t-1)\right] , \end{aligned}$$
(6)
$$\begin{aligned} r_2(t)= & {} r_2(t-1)+\Vert {\varvec{\varphi }}(t)\Vert ^2,\ r_2(0)=1. \end{aligned}$$
(7)

Since the vectors \({\varvec{\psi }}(t)\) and \({\varvec{\varphi }}(t)\) contain the unknown parameter vectors \(\varvec{b}\) and \(\varvec{a}\), the algorithm in (4)–(7) is impossible to implement. This problem can be solved by replacing \(\varvec{b}\) and \(\varvec{a}\) with their corresponding estimates \(\hat{\varvec{b}}(t-1)\) and \(\hat{\varvec{a}}(t-1)\) at time \(t-1\). Letting \(\hat{{\varvec{\psi }}}(t):=\varvec{F}(t)\hat{\varvec{b}}(t-1)\in {\mathbb {R}}^m\) and \(\hat{{\varvec{\varphi }}}(t):=\varvec{F}^{\tiny \text{ T }}(t)\hat{\varvec{a}}(t-1)\in {\mathbb {R}}^n\), we have the following HSG algorithm for bilinear-in-parameter systems in (1):

$$\begin{aligned} \hat{\varvec{a}}(t)= & {} \hat{\varvec{a}}(t-1)+\frac{\varvec{F}(t)\hat{\varvec{b}}(t-1)}{r_1(t)} \left[ y(t)-\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)\right] , \end{aligned}$$
(8)
$$\begin{aligned} r_1(t)= & {} r_1(t-1)+\Vert \varvec{F}(t)\hat{\varvec{b}}(t-1)\Vert ^2,\ r_1(0)=1, \end{aligned}$$
(9)
$$\begin{aligned} \hat{\varvec{b}}(t)= & {} \hat{\varvec{b}}(t-1)+\frac{\varvec{F}^{\tiny \text{ T }}(t)\hat{\varvec{a}}(t-1)}{r_2(t)} \left[ y(t)-\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)\right] , \end{aligned}$$
(10)
$$\begin{aligned} r_2(t)= & {} r_2(t-1)+\Vert \varvec{F}^{\tiny \text{ T }}(t)\hat{\varvec{a}}(t-1)\Vert ^2,\ r_2(0)=1. \end{aligned}$$
(11)

The initial values are taken to be \(\hat{\varvec{a}}(0)=\mathbf{1}_m/p_0\), \(\hat{\varvec{b}}(0)=\mathbf{1}_n/p_0\), where \(p_0\) is a large number, e.g., \(p_0=10^6\).

3 The Convergence Analysis

Lemma 1

[8] Assume that the nonnegative sequences T(t), \(\eta (t)\) and \(\zeta (t)\) satisfy the inequality

$$\begin{aligned} T(t)\leqslant T(t-1)+\eta (t)-\zeta (t) \end{aligned}$$

and \(\sum \limits _{t=1}^{\infty }\eta (t)<\infty \), then we have \(\sum \limits _{t=1}^{\infty }\zeta (t)<\infty \) and T(t) is bounded.

The proof of Lemma 1 is straightforward and hence omitted.

Theorem 1

For the system in (1) and the HSG algorithm in (8)–(11), assume that v(t) is a white noise sequence with zero mean and variances \(\sigma ^2\), and there exist an integer N and two positive constants \(c_1\) and \(c_2\) such that the following persistent excitation conditions hold:

$$\begin{aligned}&({\mathrm{A1}}) \quad \sum _{j=0}^{N-1}\frac{\hat{{\varvec{\psi }}}(t+j)\hat{{\varvec{\psi }}}^{\tiny \text{ T }}(t+j)}{r_1(t+j)}\geqslant c_1\varvec{I}_m,\ \mathrm{a.s.},\\&({\mathrm{A2}}) \quad \sum _{j=0}^{N-1}\frac{\hat{{\varvec{\varphi }}}(t+j)\hat{{\varvec{\varphi }}}^{\tiny \text{ T }}(t+j)}{r_2(t+j)}\geqslant c_2\varvec{I}_n,\ \mathrm{a.s.}, \end{aligned}$$

Then the parameter estimation errors converge to zero, i.e.,

$$\begin{aligned} \Vert \hat{\varvec{a}}(t)-\varvec{a}\Vert \rightarrow 0,\quad \Vert \hat{\varvec{b}}(t)-\varvec{b}\Vert \rightarrow 0. \end{aligned}$$

Proof

Define two parameter error vectors:

$$\begin{aligned} \tilde{\varvec{a}}(t):= & {} \hat{\varvec{a}}(t)-\varvec{a}\in {\mathbb {R}}^m, \end{aligned}$$
(12)
$$\begin{aligned} \tilde{\varvec{b}}(t):= & {} \hat{\varvec{b}}(t)-\varvec{b}\in {\mathbb {R}}^n. \end{aligned}$$
(13)

Substituting (1) and (8) into (12), we have

$$\begin{aligned} \tilde{\varvec{a}}(t)= & {} \tilde{\varvec{a}}(t-1)+\frac{\hat{{\varvec{\psi }}}(t)}{r_1(t)} \left[ y(t)-\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)\right] \nonumber \\= & {} \tilde{\varvec{a}}(t-1)+\frac{\hat{{\varvec{\psi }}}(t)}{r_1(t)} \left[ \varvec{a}^{\tiny \text{ T }}\varvec{F}(t)\varvec{b}-\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)+v(t)\right] \end{aligned}$$
(14)
$$\begin{aligned}= & {} \tilde{\varvec{a}}(t-1)+\frac{\hat{{\varvec{\psi }}}(t)}{r_1(t)} \left[ -\tilde{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)-\varvec{a}^{\tiny \text{ T }}\varvec{F}(t)\tilde{\varvec{b}}(t-1)+v(t)\right] \nonumber \\=: & {} \tilde{\varvec{a}}(t-1)+\frac{\hat{{\varvec{\psi }}}(t)}{r_1(t)}\left[ -\tilde{y}_1(t)-\xi _1(t)+v(t) \right] ,\ \end{aligned}$$
(15)

where

$$\begin{aligned} \tilde{y}_1(t):= & {} \tilde{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)\in {\mathbb {R}}, \end{aligned}$$
(16)
$$\begin{aligned} \xi _1(t):= & {} \varvec{a}^{\tiny \text{ T }}\varvec{F}(t)\tilde{\varvec{b}}(t-1)\in {\mathbb {R}}. \end{aligned}$$
(17)

Taking the norm of both sides of (15) and using (16) yield

$$\begin{aligned} \left\| \tilde{\varvec{a}}(t)\right\| ^2= & {} \left\| \tilde{\varvec{a}}(t-1) +\,\frac{\hat{{\varvec{\psi }}}(t)}{r_1(t)}\left[ -\tilde{y}_1(t)-\xi _1(t)+v(t) \right] \right\| ^2 \nonumber \\= & {} \left\| \tilde{\varvec{a}}(t-1)\right\| ^2+\frac{2\tilde{\varvec{a}}^{\tiny \text{ T }}(t-1) \hat{{\varvec{\psi }}}(t)}{r_1(t)} \left[ -\tilde{y}_1(t)-\xi _1(t)+v(t) \right] \nonumber \\&+\,\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)} \left[ -\tilde{y}_1(t)-\xi _1(t)+v(t) \right] ^2 \nonumber \\= & {} \left\| \tilde{\varvec{a}}(t-1)\right\| ^2+\frac{2\tilde{y}_1(t)}{r_1(t)} \left[ -\tilde{y}_1(t)-\xi _1(t)+v(t) \right] \nonumber \\&+\,\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\left[ -\tilde{y}_1(t) -\xi _1(t)+v(t) \right] ^2. \end{aligned}$$
(18)

Define \(\tilde{y}_2(t):=\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\tilde{\varvec{b}}(t-1)\in {\mathbb {R}}\), \(\xi _2(t):=\tilde{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\varvec{b}\in {\mathbb {R}}\). Similarly, we have

$$\begin{aligned} \tilde{\varvec{b}}(t)= & {} \tilde{\varvec{b}}(t-1)+\frac{\hat{{\varvec{\varphi }}}(t)}{r_2(t)}\left[ -\tilde{y}_2(t)-\xi _2(t)+v(t) \right] , \nonumber \\ \left\| \tilde{\varvec{b}}(t)\right\| ^2= & {} \left\| \tilde{\varvec{b}}(t-1)\right\| ^2+\frac{2\tilde{y}_2(t)}{r_2(t)}\left[ -\tilde{y}_2(t)-\xi _2(t)+v(t) \right] \nonumber \\&+\,\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)}\left[ -\tilde{y}_2(t)-\xi _2(t)+v(t) \right] ^2. \end{aligned}$$
(19)

Let \(T(t):=\Vert \tilde{\varvec{a}}(t)\Vert ^2+\Vert \tilde{\varvec{b}}(t)\Vert ^2\). Using (18), (19), (9) and (11) gives

$$\begin{aligned} T(t)= & {} \left\| \tilde{\varvec{a}}(t-1)\right\| ^2+\frac{2\tilde{y}_1(t)}{r_1(t)}\left[ -\tilde{y}_1(t)-\xi _1(t)+v(t) \right] \\&+\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\left[ \tilde{y}_1^2(t)+\xi _1^2(t)+v^2(t) +2\tilde{y}_1(t)\xi _1(t)-2\tilde{y}_1(t)v(t)-2\xi _1(t)v(t)\right] \\&+\,\left\| \tilde{\varvec{b}}(t-1)\right\| ^2+\frac{2\tilde{y}_2(t)}{r_2(t)} \left[ -\tilde{y}_2(t)-\xi _2(t)+v(t) \right] \\&+\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)} \left[ \tilde{y}_2^2(t)+\xi _2^2(t)+v^2(t) +\,2\tilde{y}_2(t)\xi _2(t)-2\tilde{y}_2(t)v(t)-2\xi _2(t)v(t)\right] \\= & {} T(t-1)-\left[ \frac{2}{r_1(t)}-\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\right] \tilde{y}_1^2(t) \\&+\,2\left[ \frac{1}{r_1(t)}-\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\right] \tilde{y}_1(t)\left[ v(t)-\xi _1(t)\right] \\&+\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\left[ \xi _1^2(t)+v^2(t)-2\xi _1(t)v(t)\right] -\left[ \frac{2}{r_2(t)}-\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)}\right] \tilde{y}_2^2(t)\\&+2\left[ \frac{1}{r_2(t)} -\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)}\right] \tilde{y}_2(t) \left[ v(t)-\xi _2(t)\right] \frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)} \\&\quad \left[ \xi _2^2(t)+v^2(t)-2\xi _2(t)v(t)\right] \\= & {} T(t-1)-\left[ \frac{r_1(t)+r_1(t-1)}{r_1^2(t)}\right] \tilde{y}_1^2(t) +\frac{2r_1(t-1)}{r_1^2(t)}\tilde{y}_1(t)\left[ v(t)-\xi _1(t)\right] \\&+\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\left[ \xi _1^2(t)+v^2(t)-2\xi _1(t)v(t)\right] -\left[ \frac{r_2(t)+r_2(t-1)}{r_2^2(t)}\right] \tilde{y}_2^2(t)\\&+\frac{2r_2(t-1)}{r_2^2(t)}\tilde{y}_2(t)\left[ v(t)-\xi _2(t)\right] +\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)}\xi _2^2(t)+v^2(t)-2\xi _2(t)v(t)]\\\leqslant & {} T(t-1)-\frac{1}{r_1(t)}\tilde{y}_1^2(t)+\frac{2r_1(t-1)}{r_1^2(t)}\tilde{y}_1(t)\left[ v(t)-\xi _1(t)\right] \\&+\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)} \left[ \xi _1^2(t)+v^2(t)-2\xi _1(t)v(t)\right] -\frac{1}{r_2(t)}\tilde{y}_2^2(t)\\&+\frac{2r_2(t-1)}{r_2^2(t)}\tilde{y}_2(t)\left[ v(t)-\xi _2(t)\right] +\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)} \left[ \xi _2^2(t)+v^2(t)-2\xi _2(t)v(t)\right] \end{aligned}$$
$$\begin{aligned}= & {} T(t-1)-\gamma (t)-\frac{1}{r_1(t)}\tilde{y}_1^2(t) +\frac{2r_1(t-1)}{r_1^2(t)}\tilde{y}_1(t)v(t)\nonumber \\&+\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\left[ \xi _1^2(t)+v^2(t)\right] \nonumber \\&-\frac{1}{r_2(t)}\tilde{y}_2^2(t)+\frac{2r_2(t-1)}{r_2^2(t)}\tilde{y}_2(t)v(t) \nonumber \\&+\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)} \left[ \xi _2^2(t)+v^2(t)-2\xi _2(t)v(t)\right] , \end{aligned}$$
(20)

where

$$\begin{aligned} \gamma (t):=\frac{2r_1(t-1)}{r_1^2(t)}\tilde{y}_1(t)\xi _1(t) +\frac{2r_2(t-1)}{r_2^2(t)}\tilde{y}_2(t)\xi _2(t). \end{aligned}$$

When \(\xi _1^2>\varepsilon \) or \(\xi _2^2>\varepsilon \) or \(\gamma (t)<0\) (\(\varepsilon \) is a given positive number), we let \(\tilde{\varvec{a}}(t):=\tilde{\varvec{a}}(t-1)\) and \(\tilde{\varvec{b}}(t):=\tilde{\varvec{b}}(t-1)\), and thus we have \(T(t)=T(t-1)\). When \(\xi _1^2 \leqslant \varepsilon \) and \(\xi _2^2\leqslant \varepsilon \) and \(\gamma (t)\geqslant 0\) , since v(t) is a white noise with zero mean and variance \(\sigma ^2\), and \(\varvec{F}(t)\), \(\hat{\varvec{a}}(t-1)\), \(\hat{\varvec{b}}(t-1)\), \(r_1(t)\), \(r_2(t)\), \(\xi _1(t)\) and \(\xi _2(t)\) are independent of v(t), taking expectation of both sides of (20), we have

$$\begin{aligned} \mathrm{E}[T(t)]\leqslant & {} \mathrm{E}[T(t-1)]-\mathrm{E}\left[ \frac{\tilde{y}_1^2(t)}{r_1(t)}+\frac{\tilde{y}_2^2(t)}{r_2(t)}\right] \nonumber \\&+\mathrm{E}\left[ \frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}+\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)}\right] (\sigma ^2+\varepsilon ), \end{aligned}$$
(21)

From (9), we have

$$\begin{aligned} \sum _{t=1}^{\infty }\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1^2(t)}\leqslant & {} \sum _{t=1}^{\infty }\frac{\left\| \hat{{\varvec{\psi }}}(t)\right\| ^2}{r_1(t)r_1(t-1)} =\sum _{t=1}^{\infty }\frac{r_1(t)-r_1(t-1)}{r_1(t)r_1(t-1)}\\= & {} \sum _{t=1}^{\infty }\left[ \frac{1}{r_1(t-1)}-\frac{1}{r_1(t)}\right] =\frac{1}{r_1(0)}-\frac{1}{r_1(\infty )}<\infty ,\ \mathrm {a.s.} \end{aligned}$$

Similarly, from (11), we have

$$\begin{aligned} \sum _{t=1}^{\infty }\frac{\left\| \hat{{\varvec{\varphi }}}(t)\right\| ^2}{r_2^2(t)}<\infty ,\ \mathrm{a.s.}\end{aligned}$$

Hence, summation of the last term of the right-hand side of (21) from \(t=1\) to \(\infty \) is finite. Applying Lemma 1 to (21), we conclude that \(\mathrm{E}[T(t)]\) converges to a constant. So there exist a constant \(C>0\) and \(t_0\) such that \(\mathrm{E}[T(t)]\leqslant C\) for \(t>t_0\). From (21), it follows that

$$\begin{aligned} \sum _{t=1}^{\infty }\left[ \frac{\tilde{y}_1^2(t)}{r_1(t)}+\frac{\tilde{y}_2^2(t)}{r_2(t)}\right] <\infty . \end{aligned}$$

Note that \(r_1(t)>0\) and \(r_2(t)>0\), we have

$$\begin{aligned} \sum _{t=1}^{\infty }\frac{\tilde{y}_1^2(t)}{r_1(t)}<\infty , \quad \sum _{t=1}^{\infty }\frac{\tilde{y}_2^2(t)}{r_2(t)}<\infty , \quad \lim _{t\rightarrow \infty }\frac{\tilde{y}_1^2(t)}{r_1(t)}=0, \quad \lim _{t\rightarrow \infty }\frac{\tilde{y}_2^2(t)}{r_2(t)}=0. \end{aligned}$$
(22)

Define the identification innovation

$$\begin{aligned} e(t):=y(t)-\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)\in {\mathbb {R}}, \end{aligned}$$

From (14), we have

$$\begin{aligned} \tilde{\varvec{a}}(t)=\tilde{\varvec{a}}(t-1)+\frac{\hat{{\varvec{\psi }}}(t)}{r_1(t)}e(t). \end{aligned}$$
(23)

Replacing t in (23) with \(t+j\) and successive substitutions give

$$\begin{aligned} \tilde{\varvec{a}}(t+j) =\tilde{\varvec{a}}(t)+\sum _{i=1}^{j} \frac{\hat{{\varvec{\psi }}}(t+i)}{r_1(t+i)}e(t+i). \end{aligned}$$
(24)

Using (16), it follows that

$$\begin{aligned} \tilde{y}_1(t)= & {} \hat{{\varvec{\psi }}}^{\tiny \text{ T }}(t)\tilde{\varvec{a}}(t-1), \nonumber \\ \tilde{y}_1(t+j)= & {} \hat{{\varvec{\psi }}}^{\tiny \text{ T }}(t+j)\tilde{\varvec{a}}(t+j-1). \end{aligned}$$
(25)

Substituting (24) into (25) gives

$$\begin{aligned} \hat{{\varvec{\psi }}}^{\tiny \text{ T }}(t+j)\tilde{\varvec{a}}(t)=\tilde{y}_1(t+j)-\hat{{\varvec{\psi }}}^{\tiny \text{ T }}(t+j)\sum _{i=1}^{j-1}\frac{\hat{{\varvec{\psi }}}(t+i)}{r_1(t+i)}e(t+i), \end{aligned}$$
(26)

Squaring and summing for j from \(j=1\) to \(j=N-1\), dividing by \(r_1(t+j)\), and using (A1), (24) and (26), we have

$$\begin{aligned} c_1\Vert \tilde{\varvec{a}}(t)\Vert ^2\leqslant & {} \tilde{\varvec{a}}^{\tiny \text{ T }}(t)\left[ \sum _{j=1}^{N-1}\frac{\hat{{\varvec{\psi }}}(t+j)\hat{{\varvec{\psi }}}^{\tiny \text{ T }}(t+j)}{r_1(t+j)}\right] \tilde{\varvec{a}}(t)\nonumber \\= & {} \sum _{j=1}^{N-1}\frac{\tilde{\varvec{a}}^{\tiny \text{ T }}(t)\hat{{\varvec{\psi }}}(t+j)\hat{{\varvec{\psi }}}^{\tiny \text{ T }}(t+j)\tilde{\varvec{a}}(t)}{r_1(t+j)}\nonumber \\\leqslant & {} \sum _{j=1}^{N-1}\left[ \frac{2\tilde{y}_1^2(t+j)}{r_1(t+j)} +\frac{2\left\| \hat{{\varvec{\psi }}}(t+j)\right\| ^2}{r_1(t+j)}\left\| \sum _{i=1}^{j-1}\frac{\hat{{\varvec{\psi }}}(t+i)}{r_1(t+i)}e(t+i)\right\| ^2\right] \nonumber \\= & {} \sum _{j=1}^{N-1}\left[ \frac{2\tilde{y}_1^2(t+j)}{r_1(t+j)}+\frac{2\left\| \hat{{\varvec{\psi }}}(t+j)\right\| ^2}{r_1(t+j)}\left\| \tilde{\varvec{a}}(t+j-1)-\tilde{\varvec{a}}(t)\right\| ^2\right] \nonumber \\\leqslant & {} \sum _{j=1}^{N-1}\left[ \frac{2\tilde{y}_1^2(t+j)}{r_1(t+j)} +\frac{4\left\| \hat{{\varvec{\psi }}}(t+j)\right\| ^2}{r_1(t+j)}\left( \left\| \tilde{\varvec{a}}(t+j-1)\right\| ^2+\left\| \tilde{\varvec{a}}(t)\right\| ^2\right) \right] ,\nonumber \\ \end{aligned}$$
(27)

Since \(\mathrm{E}[T(t)]=\mathrm{E}[\Vert \tilde{\varvec{a}}(t)\Vert ^2+\Vert \tilde{\varvec{b}}(t)\Vert ^2]\leqslant C\), we have \(\mathrm{E}[\Vert \tilde{\varvec{a}}(t)\Vert ^2]\leqslant C \). Taking the expectation and the limit of both sides of (27), it follows

$$\begin{aligned} \lim _{t\rightarrow \infty }\mathrm {E}\left[ \left\| \tilde{\varvec{a}}(t)\right\| ^2\right] \leqslant \lim _{t\rightarrow \infty }\frac{1}{c_1}\mathrm {E}\left\{ \sum _{j=1}^{N-1}\left[ \frac{2\tilde{y}_1^2(t+j)}{r_1(t+j)} +\frac{8C\left\| \hat{{\varvec{\psi }}}(t+j)\right\| ^2}{r_1(t+j)}\right] \right\} . \end{aligned}$$

Assume that \(\lim \limits _{t\rightarrow \infty }\Vert \hat{{\varvec{\psi }}}(t+j)\Vert ^2/r_1(t+j)=0\). Using (22) gives \(\lim \limits _{t\rightarrow \infty }\mathrm{E}\left[ \Vert \tilde{\varvec{a}}(t)\Vert ^2\right] =0\). Similarly, we can obtain \(\lim \limits _{t\rightarrow \infty }\mathrm{E}[\Vert \tilde{\varvec{b}}(t)\Vert ^2]=0\). This completes the proof. \(\square \)

In order to improve the convergence rate of the HSG algorithm, we introduce a forgetting factor \(\lambda \) \((0\leqslant \lambda \leqslant 1)\) in (8)–(11) and the corresponding algorithm is called the forgetting factor HSG (FF-HSG) algorithm, which is as follows:

$$\begin{aligned} \hat{\varvec{a}}(t)= & {} \hat{\varvec{a}}(t-1)+\frac{\varvec{F}(t)\hat{\varvec{b}}(t-1)}{r_1(t)} \left[ y(t)-\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)\right] , \end{aligned}$$
(28)
$$\begin{aligned} r_1(t)= & {} \lambda r_1(t-1)+\left\| \varvec{F}(t)\hat{\varvec{b}}(t-1)\right\| ^3,\ r_1(0)=1, \end{aligned}$$
(29)
$$\begin{aligned} \hat{\varvec{b}}(t)= & {} \hat{\varvec{b}}(t-1)+\frac{\varvec{F}^{\tiny \text{ T }}(t)\hat{\varvec{a}}(t-1)}{r_3(t)} \left[ y(t)-\hat{\varvec{a}}^{\tiny \text{ T }}(t-1)\varvec{F}(t)\hat{\varvec{b}}(t-1)\right] , \end{aligned}$$
(30)
$$\begin{aligned} r_3(t)= & {} \lambda r_3(t-1)+\left\| \varvec{F}^{\tiny \text{ T }}(t)\hat{\varvec{a}}(t-1)\right\| ^3,\ r_3(0)=1. \end{aligned}$$
(31)

Obviously, when the forgetting factor \(\lambda =1\), the FF-HSG algorithm is reduced to the HSG algorithm; when \(\lambda =0\), the FF-HSG algorithm is degenerated to the hierarchical projection algorithm.

4 Example

Consider the following bilinear-in-parameter system with \(m=2\) and \(n=3\),

$$\begin{aligned} y(t)= & {} \varvec{a}^{\tiny \text{ T }}\varvec{F}(t)\varvec{b}+v(t), \\ \varvec{F}(t)= & {} \left[ \begin{array}{c} \varvec{f}(u(t-1)) \\ \varvec{f}(u(t-2)) \end{array} \right] = \left[ \begin{array}{ccc} u(t-1) &{} u^2(t-1) &{} u^3(t-1)\\ u(t-2) &{} u^2(t-2) &{} u^3(t-2)\end{array}\right] . \\ \varvec{a}= & {} [2.06,1.00]^{\tiny \text{ T }},\quad \varvec{b}=[0.70,\sqrt{0.02},0.70]^{\tiny \text{ T }}, \nonumber \\ {\varvec{\theta }}= & {} [\varvec{a},\varvec{b}]^{\tiny \text{ T }}=[2.06,1.00,0.70,\sqrt{0.02},0.70]^{\tiny \text{ T }}, \end{aligned}$$

where \(\Vert \varvec{b}\Vert =1\). In simulation, we generate a persistent excitation sequence with zero mean and unit variance as the input u(t) and take v(t) to be an uncorrelated noise sequence with zero mean and variance \(\sigma ^2=0.10^2\). Taking the data length \(L=3000\) and using the HSG algorithm to generate the parameter estimates \(\hat{\varvec{a}}(t)\) and \(\hat{\varvec{b}}(t)\) from the input–output data \(\{y(t),\varvec{F}(t)\): \(t=1,2,3\ldots \}\), the parameter estimates and their estimation errors are given in Tables 1, 2 and 3, and the estimation error \(\delta :=\Vert \hat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert /\Vert {\varvec{\theta }}\Vert \) versus t is shown in Fig. 1.

Table 1 HSG parameter estimates and errors
Table 2 FF-HSG parameter estimates and errors (\(\lambda =0.98\))
Table 3 FF-HSG parameter estimates and errors (\(\lambda =0.95\))
Fig. 1
figure 1

HSG estimation errors \(\delta \) versus t with different forgetting factors

From Tables 1, 2, 3 and Fig. 1, we can draw the following conclusions.

  1. 1.

    The estimation errors become smaller with time t increasing—see Tables 1, 2 and 3.

  2. 2.

    The FF-HSG algorithm has faster convergence rates than the HSG algorithm, and the convergence rates increase for appropriate small forgetting factors—see Fig. 1.

5 Conclusions

This paper investigates the performances of the HSG algorithm for bilinear-in-parameter systems. The theoretical analysis shows that the estimates converge to the true values under the persistent excitation conditions, and the simulation results verify the proposed convergence theorem. The method used in this paper can be extended to analyze the convergence of the identification algorithms for linear or nonlinear control systems [7, 19, 20, 43] and applied to hybrid switching-impulsive dynamical networks [18] and uncertain chaotic delayed nonlinear systems [17] or applied to other fields [6, 27, 28].