Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

4.1 Introduction

There are a number of algorithms for adaptive filters which are derived from the conventional LMS algorithm discussed in the previous chapter. The objective of the alternative LMS-based algorithms is either to reduce computational complexity or convergence time. In this chapter, several LMS-based algorithms are presented and analyzed, namely, the quantized-error algorithms [111], the frequency-domain (or transform-domain) LMS algorithm [1214], the normalized LMS algorithm [15], the LMS-Newton algorithm [16, 17], and the affine projection algorithm [1826]. Several algorithms that are related to the main algorithms presented in this chapter are also briefly discussed.

The quantized-error algorithms reduce the computational complexity of the LMS algorithms by representing the error signal with short wordlength or by a simple power-of-two number.

The convergence speed in the LMS-Newton algorithm is independent of the eigenvalue spread of the input signal correlation matrix. This improvement is achieved by using an estimate of the inverse of the input signal correlation matrix, leading to a substantial increase in the computational complexity.

The normalized LMS algorithm utilizes a variable convergence factor that minimizes the instantaneous error. Such a convergence factor usually reduces the convergence time but increases the misadjustment.

In the frequency-domain algorithm, a transform is applied to the input signal in order to allow the reduction of the eigenvalue spread of the transformed signal correlation matrix as compared to the eigenvalue spread of the input signal correlation matrix. The LMS algorithm applied to the better conditioned transformed signal achieves faster convergence.

The affine projection algorithm reuses old data resulting in fast convergence when the input signal is highly correlated, leading to a family of algorithms that can trade-off computational complexity with convergence speed.

4.2 Quantized-Error Algorithms

The computational complexity of the LMS algorithm is mainly due to multiplications performed in the coefficient updating and in the calculation of the adaptive-filter output. In applications where the adaptive filters are required to operate in high speed, such as echo cancellation and channel equalization, it is important to minimize hardware complexity.

A first step to simplify the LMS algorithm is to apply quantization to the error signal, generating the quantized-error algorithm which updates the filter coefficients according to

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2\mu Q[e(k)]\bf{x}(k)$$
(4.1)

where Q[ ⋅] represents a quantization operation. The quantization function is discrete valued, bounded, and nondecreasing. The type of quantization identifies the quantized-error algorithm.

If the convergence factor μ is a power-of-two number, the coefficient updating can be implemented with simple multiplications, basically consisting of bit shifts and additions. In a number of applications, such as the echo cancellation in full-duplex data transmission [2] and equalization of channels with binary data [3], the input signal x(k) is a binary signal, i.e., assumes values + 1 and − 1. In this case, the adaptive filter can be implemented without any intricate multiplication.

The quantization of the error actually implies a modification in the objective function that is minimized, denoted by F[e(k)]. In a general gradient-type algorithm coefficient updating is performed by

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) - \mu \frac{\partial F[e(k)]} {\partial {\bf {w}}(k)} = {\bf {w}}(k) - \mu \frac{\partial F[e(k)]} {\partial e(k)} \frac{\partial e(k)} {\partial {\bf {w}}(k)}& &\end{array}$$
(4.2)

For a linear combiner the above equation can be rewritten as

$${\bf {w}}(k + 1) = {\bf {w}}(k) + \mu \frac{\partial F[e(k)]} {\partial e(k)} \bf{x}(k)$$
(4.3)

Therefore, the objective function that is minimized in the quantized-error algorithms is such that

$$\frac{\partial F[e(k)]} {\partial e(k)} = 2Q[e(k)]$$
(4.4)

where F[e(k)] is obtained by integrating 2Q[e(k)] with respect to e(k). Note that the chain rule applied in (4.3) is not valid at the points of discontinuity of Q[ ⋅] where F[e(k)] is not differentiable [6].

The performances of the quantized-error and LMS algorithms are obviously different. The analyses of some widely used quantized-error algorithms are presented in the following subsections.

4.2.1 Sign-Error Algorithm

The simplest form for the quantization function is the sign (sgn) function defined by

$$\begin{array}{rcl} \mathrm{sgn}[b] = \left \{\begin{array}{rl} 1,&b > 0 \\ 0,&b = 0 \\ - 1,& b < 0\\ \end{array} \right.& &\end{array}$$
(4.5)

The sign-error algorithm utilizes the sign function as the error quantizer, where the coefficient vector updating is performed by

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2\mu \ \mathrm{sgn}[e(k)]\ \bf{x}(k)$$
(4.6)

Figure 4.1 illustrates the realization of the sign-error algorithm for a delay line input x(k). If μ is a power-of-two number, one iteration of the sign-error algorithm requires N + 1 multiplications for the error generation. The total number of additions is 2N + 2. The detailed description of the sign-error algorithm is shown in Algorithm 4.1. Obviously, the vectors x(0) and w(0) can be initialized in a different way from that described in the algorithm.

Fig. 4.1
figure 1

Sign-error adaptive FIR filter: Q[e(k)] = sgn[e(k)]

The objective function that is minimized by the sign-error algorithm is the modulus of the error multiplied by two, i.e.,

$$F[e(k)] = 2\vert e(k)\vert $$
(4.7)

Note that the factor two is included only to present the sign-error and LMS algorithms in a unified form. Obviously, in real implementation this factor can be merged with convergence factor μ.

Some of the properties related to the convergence behavior of the sign-error algorithm in a stationary environment are described, following the same procedure used in the previous chapter for the LMS algorithm.

4.2.1.1 Steady-State Behavior of the Coefficient Vector

The sign-error algorithm can be alternatively described by

$$\Delta {\bf {w}}(k + 1) = \Delta {\bf {w}}(k) + 2\mu \ \mathrm{sgn}[e(k)]\ \bf{x}(k)$$
(4.8)

where \(\Delta {\bf {w}}(k) = {\bf {w}}(k) -{{\bf {w}}}_{o}\). The expected value of the coefficient-error vector is then given by

$$E[\Delta {\bf {w}}(k + 1)] = E[\Delta {\bf {w}}(k)] + 2\mu E\{\mathrm{sgn}[e(k)]\ \bf{x}(k)\}$$
(4.9)

The importance of the probability density function of the measurement noise n(k) on the convergence of the sign-error algorithm is a noteworthy characteristic. This is due to the fact that \(E\{\mathrm{sgn}[e(k)]\ \bf{x}(k)\} = E\{\mathrm{sgn}[-\Delta {{\bf {w}}}^{T}(k)\bf{x}(k) + n(k)]\bf{x}(k)\}\), where the result of the sign operation is highly dependent on the probability density function of n(k). In [1], the authors present a convergence analysis of the output MSE, i.e., E[e 2(k)], for different distributions of the additional noise, such as Gaussian, uniform, and binary distributions.

A closer examination of (4.8) indicates that even if the error signal becomes very small, the adaptive-filter coefficients will be continually updated due to the sign function applied to the error signal. Therefore, in a situation where the adaptive filter has a sufficient number of coefficients to model the desired signal, and there is no additional noise, Δw(k) will not converge to zero. In this case, w(k) will be convergent to a balloon centered at w o , when μ is appropriately chosen. The mean absolute value of e(k) is also convergent to a balloon centered around zero, that means | e(k) | remains smaller than the balloon radius r [6].

Recall that the desired signal without measurement noise is denoted as d′(k). If it is considered that d′(k) and the elements of x(k) are zero mean and jointly Gaussian and that the additional noise n(k) is also zero mean, Gaussian, and independent of x(k) and d′(k), the error signal will also be zero-mean Gaussian signal conditioned on Δw(k). In this case, using the results of the Price theorem described in [29] and in Papoulis [30], the following result is valid

$$E\{\mathrm{sgn}[e(k)]\ \bf{x}(k)\} \approx \sqrt{ \frac{2} {\pi \xi (k)}}E[\bf{x}(k)e(k)]$$
(4.10)

where ξ(k) is the variance of e(k) assuming the error has zero mean. The above approximation is valid for small values of μ. For large μ, e(k) is dependent on Δw(k) and conditional expected value on Δw(k) should be used instead [35].

By applying (4.10) in (4.9) and by replacing e(k) by e o (k) − Δw T(k)x(k), it follows that

$$\begin{array}{rcl} E[\Delta {\bf {w}}(k + 1)]& =& \left \{\bf{I} - 2\mu \sqrt{ \frac{2} {\pi \xi (k)}}E[\bf{x}(k){\bf{x}}^{T}(k)]\right \}\ E[\Delta {\bf {w}}(k)] \\ & & +2\mu \sqrt{ \frac{2} {\pi \xi (k)}}\ E[{e}_{o}(k)\bf{x}(k)] \end{array}$$
(4.11)

From the orthogonality principle we know that E[e o (k)x(k)] = 0, so that the last element of the above equation is zero. Therefore,

$$E[\Delta {\bf {w}}(k + 1)] = \left [\bf{I} - 2\mu \sqrt{ \frac{2} {\pi \xi (k)}}\bf{R}\right ]\ E[\Delta {\bf {w}}(k)]$$
(4.12)

Following the same steps for the analysis of Ew(k)] in the traditional LMS algorithm, it can be shown that the coefficients of the adaptive filter implemented with the sign-error algorithm converge in the mean if the convergence factor is chosen in the range

$$0 < \mu < \frac{1} {{\lambda }_{\mathrm{max}}}\sqrt{\frac{\pi \xi (k)} {2}}$$
(4.13)

where λmax is the largest eigenvalue of R. It should be mentioned that in case \(\frac{{\lambda }_{\mathrm{max}}} {{\lambda }_{\mathrm{min}}}\) is large, the convergence speed of the coefficients depends on the value of λmin which is related to the slowest mode in (4.12). This conclusion can be drawn by following the same steps of the convergence analysis of the LMS algorithm, where by applying a transformation to (4.12) we obtain an equation similar to (3.17).

A more practical range for μ, avoiding the use of eigenvalue, is given by

$$0 < \mu < \frac{1} {\mathrm{tr}[\bf{R}]}\sqrt{\frac{\pi \xi (k)} {2}}$$
(4.14)

Note that the upper bound for the value of μ requires the knowledge of the MSE, i.e., ξ(k).

4.2.1.2 Coefficient-Error-Vector Covariance Matrix

The covariance of the coefficient-error vector defined as

$$\mathrm{cov}[\Delta {\bf {w}}(k)] = E\left [\left ({\bf {w}}(k) -{{\bf {w}}}_{o}\right ){\left ({\bf {w}}(k) -{{\bf {w}}}_{o}\right )}^{T}\right ]$$
(4.15)

is calculated by replacing (4.8) in (4.15) following the same steps used in the LMS algorithm. The resulting difference equation for cov[Δw(k)] is given by

$$\begin{array}{rcl} \mathrm{cov}[\Delta {\bf {w}}(k + 1)]& =& \mathrm{cov}[\Delta {\bf {w}}(k)] + 2\mu E\{\mathrm{sgn}[e(k)]\bf{x}(k)\Delta {{\bf {w}}}^{T}(k)\} \\ & & +2\mu E\{\mathrm{sgn}[e(k)]\Delta {\bf {w}}(k){\bf{x}}^{T}(k)\} + 4{\mu }^{2}\bf{R}\end{array}$$
(4.16)

The first term with expected value operation in the above equation can be expressed as

$$\begin{array}{rcl} E\{\mathrm{sgn}[e(k)]\bf{x}(k)\Delta {{\bf {w}}}^{T}(k)\}& =& E\{\mathrm{sgn}[{e}_{ o}(k)-\Delta {{\bf {w}}}^{T}(k)\bf{x}(k)]\bf{x}(k)\Delta {{\bf {w}}}^{T}(k)\} \\ & =& E\{E[\mathrm{sgn}[{e}_{o}(k)-\!\Delta {{\bf {w}}}^{T}(k)\bf{x}(k)]\bf{x}(k)\vert \Delta {\bf {w}}(k)]\Delta {{\bf {w}}}^{T}(k)\}\\ \end{array}$$

where E[a | Δw(k)] is the expected value of a conditioned on the value of Δw(k). In the first equality, e(k) was replaced by the relation \(d(k) -{{\bf {w}}}^{T}(k)\bf{x}(k) -{{\bf {w}}}_{o}^{T}\bf{x}(k) +{ {\bf {w}}}_{o}^{T}\bf{x}(k) = {e}_{o}(k) - \Delta {{\bf {w}}}^{T}(k)\bf{x}(k)\). In the second equality, the concept of conditioned expected value was applied.

Using the Price theorem and considering that the minimum output error e o (k) is zero-mean and uncorrelated with x(k), the following approximations result

$$\begin{array}{rcl} & & E\{E[\mathrm{sgn}[{e}_{o}(k) - \Delta {{\bf {w}}}^{T}(k)\bf{x}(k)]\bf{x}(k)\vert \Delta {\bf {w}}(k)]\Delta {{\bf {w}}}^{T}(k)\} \\ & & \quad \approx E\left \{\sqrt{ \frac{2} {\pi \xi (k)}}E[{e}_{o}(k)\bf{x}(k) -\bf{x}(k){\bf{x}}^{T}(k)\Delta {\bf {w}}(k)\vert \Delta {\bf {w}}(k)]\Delta {{\bf {w}}}^{T}(k)\right \} \\ & & \quad \approx -E\left \{\sqrt{ \frac{2} {\pi \xi (k)}}\bf{R}\Delta {\bf {w}}(k)\Delta {{\bf {w}}}^{T}(k)\right \} \\ & & \quad = -\sqrt{ \frac{2} {\pi \xi (k)}}\bf{R}\mathrm{cov}[\Delta {\bf {w}}(k)] \end{array}$$
(4.17)

Following similar steps to derive the above equation, the second term with the expected value operation in (4.16) can be approximated as

$$E\{\mathrm{sgn}[e(k)]\Delta {\bf {w}}(k){\bf{x}}^{T}(k)\} \approx -\sqrt{ \frac{2} {\pi \xi (k)}}\mathrm{cov}[\Delta {\bf {w}}(k)]\bf{R}$$
(4.18)

Substituting (4.17) and (4.18) in (4.16), we can calculate the vector v (k) consisting of diagonal elements of cov[Δw (k)], using the same steps employed in the LMS case (see (3.26)). The resulting dynamic equation for v (k) is given by

$${ \bf{v}}^{{\prime}}(k + 1) = \left (\bf{I} - 4\mu \sqrt{ \frac{2} {\pi \xi (k)}}\ {\Lambda }\right )\ {\bf{v}}^{{\prime}}(k) + 4{\mu }^{2}\lambda $$
(4.19)

The value of μ must be chosen in a range that guarantees the convergence of v (k), which is given by

$$0 < \mu < \frac{1} {2{\lambda }_{\mathrm{max}}}\sqrt{\frac{\pi \xi (k)} {2}}$$
(4.20)

A more severe and practical range for μ is

$$0 < \mu < \frac{1} {2\mathrm{tr}[\bf{R}]}\sqrt{\frac{\pi \xi (k)} {2}}$$
(4.21)

For k →  each element of v (k) tends to

$${v}_{i}(\infty ) = \mu \sqrt{\frac{\pi \xi (\infty )} {2}}$$
(4.22)

4.2.1.3 Excess Mean-Square Error and Misadjustment

The excess MSE can be expressed as a function of the elements of v (k) by

$$\Delta \xi (k) = \sum\limits_{i=0}^{N}{\lambda }_{ i}{v}_{i}(k) = {\lambda }^{T}{\bf{v}}^{{\prime}}(k)$$
(4.23)

Substituting (4.22) in (4.23) yields

$$\begin{array}{rcl}{ \xi }_{\mathrm{exc}}& =& \mu \sum\limits_{i=0}^{N}{\lambda }_{ i}\sqrt{\frac{\pi \xi (k)} {2}},k \rightarrow \infty \\ & =& \mu \sum\limits_{i=0}^{N}{\lambda }_{ i}\sqrt{\pi \ \frac{{\xi }_{\mathrm{min } } + {\xi }_{\mathrm{exc } } } {2}} \end{array}$$
(4.24)

since \({\lim }_{k\rightarrow \infty }\xi (k) = {\xi }_{\mathrm{min}} + {\xi }_{\mathrm{exc}}\). Therefore,

$${\xi }_{\mathrm{exc}}^{2} = {\mu }^{2}{\left (\sum\limits_{i=0}^{N}{\lambda }_{ i}\right )}^{2}\left (\frac{\pi {\xi }_{\mathrm{min}}} {2} + \frac{\pi {\xi }_{\mathrm{exc}}} {2} \right )$$
(4.25)

There are two solutions for ξexc 2 in the above equation, where only the positive one is valid. The meaningful solution for ξexc, when μ is small, is approximately given by

$$\begin{array}{rcl}{ \xi }_{\mathrm{exc}}& \approx & \mu \sqrt{\frac{\pi {\xi }_{\mathrm{min } } } {2}} \ \sum\limits_{i=0}^{N}{\lambda }_{ i} \\ & =& \mu \sqrt{\frac{\pi {\xi }_{\mathrm{min } } } {2}} \ \mathrm{tr}[\bf{R}] \end{array}$$
(4.26)

By comparing the excess MSE predicted by the above equation with the corresponding (3.49) for the LMS algorithm, it can be concluded that both can generate the same excess MSE if μ in the sign-error algorithm is chosen such that

$$\mu = {\mu }_{\mathrm{LMS}}\sqrt{ \frac{2} {\pi }{\xi }_{\mathrm{min}}^{-1}}$$
(4.27)

The misadjustment in the sign-error algorithm is

$$M = \mu \sqrt{ \frac{\pi } {2{\xi }_{\mathrm{min}}}}\ \mathrm{tr}[\bf{R}]$$
(4.28)

Equation (4.26) would leave the impression that if there is no additional noise and there are sufficient parameters in the adaptive filter, the output MSE would converge to zero. However, when ξ(k) becomes small, | | Ew(k + 1)] | | in (4.11) can increase, since the condition of (4.13) will not be satisfied. This is the situation where the parameters reach the convergence balloon. In this case, from (4.8) we can conclude that

$$\vert \vert \Delta {\bf {w}}(k + 1)\vert {\vert }^{2} -\vert \vert \Delta {\bf {w}}(k)\vert {\vert }^{2} = -4\mu \ \mathrm{sgn}[e(k)]\ e(k) + 4{\mu }^{2}\vert \vert \bf{x}(k)\vert {\vert }^{2}$$
(4.29)

from where it is possible to show that a decrease in the norm of Δw(k) is obtained only when

$$\vert e(k)\vert > \mu \vert \vert \bf{x}(k)\vert {\vert }^{2}$$
(4.30)

For no additional noise, first transpose the vectors in (4.8) and postmultiply each side by x(k). Next, squaring the resulting equation and applying the expected value operation on each side, the obtained result is

$$E[{e}^{2}(k + 1)] = E[{e}^{2}(k)] - 4\mu E[\vert e(k)\vert \ \vert \vert \bf{x}(k)\vert {\vert }^{2}] + 4{\mu }^{2}E[\vert \vert \bf{x}(k)\vert {\vert }^{4}]$$
(4.31)

After convergence E[e 2(k + 1)] ≈ E[e 2(k)]. Also, considering that

$$\begin{array}{rcl} E[\vert e(k)\vert \ \vert \vert \bf{x}(k)\vert {\vert }^{2}] \approx E[\vert e(k)\vert ]E[\vert \vert \bf{x}(k)\vert {\vert }^{2}]& & \\ \end{array}$$

and

$$\begin{array}{rcl} \frac{E[\vert \vert \bf{x}(k)\vert {\vert }^{4}]} {E[\vert \vert \bf{x}(k)\vert {\vert }^{2}]} \approx E[\vert \vert \bf{x}(k)\vert {\vert }^{2}]& & \\ \end{array}$$

we conclude that

$$E[\vert e(k)\vert ] \approx \mu E[\vert \vert \bf{x}(k)\vert {\vert }^{2}],k \rightarrow \infty $$
(4.32)

For zero-mean Gaussian e(k), the following approximation is valid

$$E[\vert e(k)\vert ] \approx \sqrt{ \frac{2} {\pi }}{\sigma }_{e}(k),k \rightarrow \infty $$
(4.33)

therefore, the expected variance of e(k) is

$${\sigma }_{e}^{2}(k) \approx \frac{\pi } {2} {\mu }^{2}\ {\mathrm{tr}}^{2}[\bf{R}],k \rightarrow \infty $$
(4.34)

where we used the relation tr[R] = E[ | | x(k) | | 2]. This relation gives an estimate of the variance of the output error when no additional noise exists. As can be noted, unlike the LMS algorithm, there is an excess MSE in the sign-error algorithm caused by the nonlinear device, even when σ n 2 = 0.

If n(k) has frequently large absolute values as compared to − Δw T(k)x(k), then for most iterations sgn[e(k)] = sgn[n(k)]. As a result, the sign-error algorithm is fully controlled by the additional noise. In this case, the algorithm does not converge.

4.2.1.4 Transient Behavior

The ratios \({r}_{{w}_{i}}\) of the geometric decaying convergence curves of the coefficients in the sign-error algorithm can be derived from (4.12) by employing an identical analysis of the transient behavior for the LMS algorithm. The ratios are given by

$${r}_{{w}_{i}} = \left (1 - 2\mu \sqrt{ \frac{2} {\pi \xi (k)}}{\lambda }_{i}\right )$$
(4.35)

for i = 0, 1, , N. If μ is chosen as suggested in (4.27), in order to reach the same excess MSE of the LMS algorithm, then

$${r}_{{w}_{i}} = \left (1 - \frac{4} {\pi }{\mu }_{\mathrm{LMS}}\sqrt{\frac{{\xi }_{\mathrm{min } } } {\xi (k)}}\ {\lambda }_{i}\right )$$
(4.36)

By recalling that \({r}_{{w}_{i}}\) for the LMS algorithm is (1 − 2μLMSλ i ), since \(\frac{2} {\pi }\sqrt{\frac{{\xi }_{\mathrm{min } } } {\xi (k)}} < 1\), it is concluded that the sign-error algorithm is slower than the LMS for the same excess MSE.

Example 4.1.

Suppose in an adaptive-filtering environment that the input signal consists of

$$x(k) ={ \mathrm{e}}^{\mathrm{J}{\omega }_{0}k} + n(k)$$

and that the desired signal is given by

$$d(k) ={ \mathrm{e}}^{\mathrm{J}{\omega }_{0}(k-1)}$$

where n(k) is a uniformly distributed white noise with variance σ n 2 = 0. 1 and \({\omega }_{0} = \frac{2\pi } {M}\). In this case M = 8.

Compute the input signal correlation matrix for a first-order adaptive filter. Calculate the value of μmax for the sign-error algorithm.

Solution.

The input signal correlation matrix for this example can be calculated as shown below:

$$\begin{array}{rcl} \bf{R} = \left [\begin{array}{cc} 1 + {\sigma }_{n}^{2} & {\mathrm{e}}^{\mathrm{J}{\omega }_{0}} \\ {\mathrm{e}}^{-\mathrm{J}{\omega }_{0}} & 1 + {\sigma }_{n}^{2}\\ \end{array} \right ]& & \\ \end{array}$$

Since in this case tr[R] = 2. 2 and ξmin = 0. 1, we have

$$\begin{array}{rcl}{ \xi }_{\mathrm{exc}} \approx \mu \sqrt{\frac{\pi {\xi }_{\mathrm{min } } } {2}} \ \mathrm{tr}[\bf{R}] = 0.87\mu & & \\ \end{array}$$

The range of values of the convergence factor is given by

$$\begin{array}{rcl} 0 < \mu < \frac{1} {2\mathrm{tr}[\bf{R}]}\sqrt{\frac{\pi ({\xi }_{\mathrm{min } } + {\xi }_{\mathrm{exc } } )} {2}} & & \\ \end{array}$$

From the above expression, it is straightforward to calculate the upper bound for the convergence factor that is given by

$$\begin{array}{rcl}{ \mu }_{\mathrm{max}} \approx 0.132& & \\ \end{array}$$

 □ 

4.2.2 Dual-Sign Algorithm

The dual-sign algorithm attempts to perform large corrections to the coefficient vector when the modulus of the error signal is larger than a prescribed level. The basic motivation to use the dual-sign algorithm is to avoid the slow convergence inherent to the sign-error algorithm that is caused by replacing e(k) by sgn[e(k)] when | e(k) | is large.

The quantization function for the dual-sign algorithm is given by

$$\begin{array}{rcl} \mathrm{ds}[a] = \left \{\begin{array}{l@{\quad }l} \epsilon \ \mathrm{sgn}[a],\quad &\vert a\vert > \rho \\ \mathrm{sgn } [a], \quad &\vert a\vert \leq \rho \\ \quad \end{array} \right.& &\end{array}$$
(4.37)

where ε > 1 is a power of two. The dual-sign algorithm utilizes the function above described as the error quantizer, and the coefficient updating is performed as

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2\mu \ \mathrm{ds}[e(k)]\bf{x}(k)$$
(4.38)

The objective function that is minimized by the dual-sign algorithm is given by

$$\begin{array}{rcl} F[e(k)] = \left \{\begin{array}{l@{\quad }l} 2\epsilon \vert e(k)\vert - 2\rho (\epsilon - 1),\quad &\vert e(k)\vert > \rho \\ 2\vert e(k)\vert, \quad &\vert e(k)\vert \leq \rho \\ \quad \end{array} \right.& &\end{array}$$
(4.39)

where the constant 2ρ(ε − 1) was included in the objective function to make it continuous. Obviously the gradient of F[e(k)] with respect to the filter coefficients is 2μ ds[e(k)]x(k) except at points where ds[e(k)] is nondifferentiable [6].

The same analysis procedure used for the sign-error algorithm can be applied to the dual-sign algorithm except for the fact that the quantization function is now different. The alternative quantization leads to particular expectations of nonlinear functions whose solutions are not presented here. The interested reader should refer to the work of Mathews [7]. The choices of ε and ρ determine the convergence behavior of the dual-sign algorithm [7], typically, a large ε tends to increase both convergence speed and excess MSE. A large ρ tends to reduce both the convergence speed and the excess MSE. If lim k →  ξ(k) ≪ ρ2, the excess MSE of the dual-sign algorithm is approximately equal to the one given by (4.26) for the sign-error algorithm [7], since in this case | e(k) | is usually much smaller than ρ. For a given MSE in steady state, the dual-sign algorithm is expected to converge faster than the sign-error algorithm.

4.2.3 Power-of-Two Error Algorithm

The power-of-two error algorithm applies to the error signal a quantization defined by

$$\begin{array}{rcl} \mathrm{pe}[b] = \left \{\begin{array}{l@{\quad }l} \mathrm{sgn}[b], \quad &\vert b\vert \geq 1 \\ {2}^{\mathrm{floor}[lo{g}_{2}\vert b\vert ]}\ \mathrm{sgn}[b],\quad &{2}^{-{b}_{d}+1} \leq \vert b\vert < 1 \\ \tau \mathrm{sgn}[b], \quad &\vert b\vert < {2}^{-{b}_{d}+1} \\ \quad \end{array} \right.& &\end{array}$$
(4.40)

where floor[ ⋅] indicates integer smaller than [ ⋅], b d is the data wordlength excluding the sign bit, and τ is usually 0 or \({2}^{-{b}_{d}}\).

The coefficient updating for the power-of-two error algorithm is given by

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2\mu \ \mathrm{pe}[e(k)]\bf{x}(k)$$
(4.41)

For \(\tau = {2}^{-{b}_{d}}\), the additional noise and the convergence factor can be arbitrarily small and the algorithm will not stop updating. For τ = 0, when \(\vert e(k)\vert < {2}^{-{b}_{d}+1}\) the algorithm reaches the so-called dead zone, where the algorithm stops updating if | e(k) | is smaller than \({2}^{-{b}_{d}+1}\) most of the time [4, 8].

A simplified and somewhat accurate analysis of this algorithm can be performed by approximating the function pe[e(k)] by a straight line passing through the center of each quantization step. In this case, the quantizer characteristics can be approximated by \(\mathrm{pe}[e(k)] \approx \frac{2} {3}e(k)\) as illustrated in Fig. 4.2. Using this approximation, the algorithm analysis can be performed exactly in the same way as the LMS algorithm. The results for the power-of-two error algorithm can be obtained from the results for the LMS algorithm, by replacing μ by \(\frac{2} {3}\mu \). It should be mentioned that such results are only approximate, and more accurate ones can be found in [8].

Fig. 4.2
figure 2

Transfer characteristic of a quantizer with 3 bits and τ = 0

4.2.4 Sign-Data Algorithm

The algorithms discussed in this subsection cannot be considered as quantized error algorithms, but since they were proposed with similar motivation we decided to introduce them here. An alternative way to simplify the computational burden of the LMS algorithm is to apply quantization to the data vector x(k). One possible quantization scheme is to apply the sign function to the input signals, giving rise to the sign-data algorithm whose coefficient updating is performed as

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2\mu e(k)\ \mathrm{sgn}[\bf{x}(k)]$$
(4.42)

where the sign operation is applied to each element of the input vector.

The quantization of the data vector can lead to a decrease in the convergence speed, and possible divergence. In the LMS algorithm, the average gradient direction follows the true gradient direction (or steepest-descent direction), whereas in the sign-data algorithm only a discrete set of directions can be followed. The limitation in the gradient direction followed by the sign-data algorithm may cause updates that result in frequent increase in the squared error, leading to instability. Therefore, it is relatively easy to find inputs that would lead to the convergence of the LMS algorithm and to the divergence of the sign-data algorithm [6, 9]. It should be mentioned, however, that the sign-data algorithm is stable for Gaussian inputs, and, as such, has been found useful in certain applications.

Another related algorithm is the sign-sign algorithm that has very simple implementation. The coefficient updating in this case is given by

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2\mu \ \mathrm{sgn}[e(k)]\ \mathrm{sgn}[\bf{x}(k)]$$
(4.43)

The sign-sign algorithm also presents the limitations related to the quantized-data algorithm.

4.3 The LMS-Newton Algorithm

In this section, the LMS-Newton algorithm incorporating estimates of the second-order statistics of the environment signals is introduced. The objective of the algorithm is to avoid the slow convergence of the LMS algorithm when the input signal is highly correlated. The improvement in the convergence rate is achieved at the expense of an increased computational complexity.

Nonrecursive realization of the adaptive filter leads to an MSE surface that is a quadratic function of the filter coefficients. For the direct-form FIR structure, the MSE can be described by

$$\begin{array}{rcl} \xi (k + 1)& =& \xi (k) +{{ \bf{g}}_{{\bf {w}}}}^{T}(k)\left [{\bf {w}}(k + 1) -{\bf {w}}(k)\right ] \\ & & +{\left [{\bf {w}}(k + 1) -{\bf {w}}(k)\right ]}^{T}\bf{R}\left [{\bf {w}}(k + 1) -{\bf {w}}(k)\right ]\end{array}$$
(4.44)

ξ(k) represents the MSE when the adaptive-filter coefficients are fixed at w(k) and \({\bf{g}}_{{\bf {w}}}(k) = -2\bf{p} + 2\bf{R}{\bf {w}}(k)\) is the gradient vector of the MSE surface as related to the filter coefficients at w(k). The MSE is minimized at the instant k + 1 if

$${\bf {w}}(k + 1) = {\bf {w}}(k) -\frac{1} {2}{\bf{R}}^{-1}{\bf{g}}_{{\bf {w}}}(k)$$
(4.45)

This equation is the updating formula of the Newton method. Note that in the ideal case, where matrix R and gradient vector g w (k) are known precisely, \({\bf {w}}(k + 1) ={ \bf{R}}^{-1}\bf{p} ={ {\bf {w}}}_{o}\). Therefore, the Newton method converges to the optimal solution in a single iteration, as expected for a quadratic objective function.

In practice, only estimates of the autocorrelation matrix R and of the gradient vector are available. These estimates can be applied to the Newton updating formula in order to derive a Newton-like method given by

$${\bf {w}}(k + 1) = {\bf {w}}(k) - \mu {\hat{\bf{R}}}^{-1}(k){\hat{\bf{g}}}_{{\bf {w}}}(k)$$
(4.46)

The convergence factor μ is introduced so that the algorithm can be protected from divergence, originated by the use of noisy estimates of R and g w (k).

For stationary input signals, an unbiased estimate of R is

$$\begin{array}{rcl} \hat{\bf{R}}(k)& =& \frac{1} {k + 1}\sum\limits_{i=0}^{k}\bf{x}(i){\bf{x}}^{T}(i) \\ & =& \frac{k} {k + 1}\hat{\bf{R}}(k - 1) + \frac{1} {k + 1}\bf{x}(k){\bf{x}}^{T}(k)\end{array}$$
(4.47)

since

$$\begin{array}{rcl} E[\hat{\bf{R}}(k)]& =& \frac{1} {k + 1}\sum\limits_{i=0}^{k}E[\bf{x}(i){\bf{x}}^{T}(i)] \\ & =& \bf{R} \end{array}$$
(4.48)

However, this is not a practical estimate for R, since for large k any change on the input signal statistics would be disregarded due to the infinite memory of the estimation algorithm.

Another form to estimate the autocorrelation matrix can be generated by employing a weighted summation as follows:

$$\begin{array}{rcl} \hat{\bf{R}}(k)& =& \alpha \bf{x}(k){\bf{x}}^{T}(k) + (1 - \alpha )\hat{\bf{R}}(k - 1) \\ & =& \alpha \bf{x}(k){\bf{x}}^{T}(k) + \alpha \sum\limits_{i=0}^{k-1}{(1 - \alpha )}^{k-i}\bf{x}(i){\bf{x}}^{T}(i)\end{array}$$
(4.49)

where in practice, α is a small factor chosen in the range 0 < α ≤ 0. 1. This range of values of α allows a good balance between the present and past input signal information. By taking the expected value on both sides of the above equation and assuming that k → , it follows that

$$\begin{array}{rcl} E[\hat{\bf{R}}(k)]& =& \alpha \sum\limits_{i=0}^{k}{(1 - \alpha )}^{k-i}E[\bf{x}(i){\bf{x}}^{T}(i)] \\ & =& \bf{R}\:\:\:\:\:\:\:\:k \rightarrow \infty \end{array}$$
(4.50)

Therefore, the estimate of R of (4.49) is unbiased.

In order to avoid inverting \(\hat{\bf{R}}(k)\), which is required by the Newton-like algorithm, we can use the so-called matrix inversion lemma given by

$${[\bf{A} + \bf{B}\bf{C}\bf{D}]}^{-1} ={ \bf{A}}^{-1} -{\bf{A}}^{-1}\bf{B}{[{\bf{D}\bf{A}}^{-1}\bf{B} +{ \bf{C}}^{-1}]}^{-1}{\bf{D}\bf{A}}^{-1}$$
(4.51)

where A, B, C and D are matrices of appropriate dimensions, and A and C are nonsingular. The above relation can be proved by simply showing that the result of premultiplying the expression on the right-hand side by A + BCD is the identity matrix (see problem 21). If we choose \(\bf{A} = (1 - \alpha )\) \(\hat{\bf{R}}(k - 1)\), \(\bf{B} ={ \bf{D}}^{T} = \bf{x}(k)\), and C = α, it can be shown that

$${ \hat{\bf{R}}}^{-1}(k) = \frac{1} {1 - \alpha }\left [{\hat{\bf{R}}}^{-1}(k - 1) -\frac{{\hat{\bf{R}}}^{-1}(k - 1)\bf{x}(k){\bf{x}}^{T}(k){\hat{\bf{R}}}^{-1}(k - 1)} {\frac{1-\alpha } {\alpha } +{ \bf{x}}^{T}(k){\hat{\bf{R}}}^{-1}(k - 1)\bf{x}(k)} \right ]$$
(4.52)

The resulting equation to calculate \(\hat{{\bf{R}}}^{-1}(k)\) is less complex to update (of order N 2 multiplications) than the direct inversion of \(\hat{\bf{R}}(k)\) at every iteration (of order N 3 multiplications).

If the estimate for the gradient vector used in the LMS algorithm is applied in (4.46), the following coefficient updating formula for the LMS-Newton algorithm results

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2\:\mu \:e(k)\:{\hat{\bf{R}}}^{-1}(k)\bf{x}(k)$$
(4.53)

The complete LMS-Newton algorithm is outlined in Algorithm 4.2. It should be noticed that alternative initialization procedures to the one presented in Algorithm 4.2 are possible.

As previously mentioned, the LMS gradient direction has the tendency to approach the ideal gradient direction. Similarly, the vector resulting from the multiplication of \({\hat{\bf{R}}}^{-1}(k)\) to the LMS gradient direction tends to approach the Newton direction. Therefore, we can conclude that the LMS-Newton algorithm converges in a more straightforward path to the minimum of the MSE surface. It can also be shown that the convergence characteristics of the algorithm is independent of the eigenvalue spread of R.

The LMS-Newton algorithm is mathematically identical to the recursive least-squares (RLS) algorithm if the forgetting factor (λ) in the latter is chosen such that \(2\mu = \alpha = 1 - \lambda \) [41]. Since a complete discussion of the RLS algorithm is given later, no further discussion of the LMS-Newton algorithm is included here.

4.4 The Normalized LMS Algorithm

If one wishes to increase the convergence speed of the LMS algorithm without using estimates of the input signal correlation matrix, a variable convergence factor is a natural solution. The normalized LMS algorithm usually converges faster than the LMS algorithm, since it utilizes a variable convergence factor aiming at the minimization of the instantaneous output error.

The updating equation of the LMS algorithm can employ a variable convergence factor μ k in order to improve the convergence rate. In this case, the updating formula is expressed as

$${\bf {w}}(k + 1) = {\bf {w}}(k) + 2{\mu }_{k}e(k)\bf{x}(k) = {\bf {w}}(k) + \Delta \tilde{{\bf {w}}}(k)$$
(4.54)

where μ k must be chosen with the objective of achieving a faster convergence. A possible strategy is to reduce the instantaneous squared error as much as possible. The motivation behind this strategy is that the instantaneous squared error is a good and simple estimate of the MSE.

The instantaneous squared error is given by

$${e}^{2}(k) = {d}^{2}(k) +{ {\bf {w}}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k){\bf {w}}(k) - 2d(k){{\bf {w}}}^{T}(k)\bf{x}(k)$$
(4.55)

If a change given by \(\tilde{{\bf {w}}}(k) = {\bf {w}}(k) + \Delta \tilde{{\bf {w}}}(k)\) is performed in the weight vector, the corresponding squared error can be shown to be

$$\begin{array}{rcl} \tilde{{e}}^{2}(k)& =& {e}^{2}(k) + 2\Delta {\tilde{{\bf {w}}}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k){\bf {w}}(k) + \Delta {\tilde{{\bf {w}}}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k)\Delta \tilde{{\bf {w}}}(k) \\ & & -2d(k)\Delta {\tilde{{\bf {w}}}}^{T}(k)\bf{x}(k) \\ & & \end{array}$$
(4.56)

It then follows that

$$\begin{array}{rcl} \Delta {e}^{2}(k)& \stackrel{\bigtriangleup }{=}& \tilde{{e}}^{2}(k) - {e}^{2}(k) \\ & = & -2\Delta {\tilde{{\bf {w}}}}^{T}(k)\bf{x}(k)e(k) + \Delta {\tilde{{\bf {w}}}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k)\Delta \tilde{{\bf {w}}}(k)\end{array}$$
(4.57)

In order to increase the convergence rate, the objective is to make Δe 2(k) negative and minimum by appropriately choosing μ k .

By replacing \(\Delta \tilde{{\bf {w}}}(k) = 2{\mu }_{k}e(k)\bf{x}(k)\) in (4.57), it follows that

$$\Delta {e}^{2}(k) = -4{\mu }_{ k}{e}^{2}(k){\bf{x}}^{T}(k)\bf{x}(k) + 4{\mu }_{ k}^{2}{e}^{2}(k){[{\bf{x}}^{T}(k)\bf{x}(k)]}^{2}$$
(4.58)

The value of μ k such that \(\frac{\partial \Delta {e}^{2}(k)} {\partial {\mu }_{k}} = 0\) is given by

$${\mu }_{k} = \frac{1} {2{\bf{x}}^{T}(k)\bf{x}(k)}$$
(4.59)

This value of μ k leads to a negative value of Δe 2(k), and, therefore, it corresponds to a minimum point of Δe 2(k).

Using this variable convergence factor, the updating equation for the LMS algorithm is then given by

$${\bf {w}}(k + 1) = {\bf {w}}(k) + \frac{e(k)\bf{x}(k)} {{\bf{x}}^{T}(k)\bf{x}(k)}$$
(4.60)

Usually a fixed convergence factor μ n is introduced in the updating formula in order to control the misadjustment, since all the derivations are based on instantaneous values of the squared errors and not on the MSE. Also a parameter γ should be included, in order to avoid large step sizes when x T(k)x(k) becomes small. The coefficient updating equation is then given by

$${\bf {w}}(k + 1) = {\bf {w}}(k) + \frac{{\mu }_{n}} {\gamma +{ \bf{x}}^{T}(k)\bf{x}(k)}\:e(k)\:\bf{x}(k)$$
(4.61)

The resulting algorithm is called the normalized LMS algorithm, and is summarized in Algorithm 4.3.

The range of values of μ n to guarantee stability can be derived by first considering that E[x T(k)x(k)] = tr[R] and that

$$\begin{array}{rcl} E\left [ \frac{e(k)\bf{x}(k)} {{\bf{x}}^{T}(k)\bf{x}(k)}\right ] \approx \frac{E[e(k)\bf{x}(k)]} {E[{\bf{x}}^{T}(k)\bf{x}(k)]}& & \\ \end{array}$$

Next, consider that the average value of the convergence factor actually applied to the LMS direction 2e(k)x(k) is \(\frac{{\mu }_{n}} {2\:\mathrm{tr}[\bf{R}]}\). Finally, by comparing the updating formula of the standard LMS algorithm with that of the normalized LMS algorithm, the desired upper bound result follows:

$$0 < \mu = \frac{{\mu }_{n}} {2\:\mathrm{tr}[\bf{R}]} < \frac{1} {\mathrm{tr}[\bf{R}]}$$
(4.62)

or 0 < μ n  < 2. In practice the convergence factor is chosen in the range 0 < μ n  ≤ 1.

4.5 The Transform-Domain LMS Algorithm

The transform-domain LMS algorithm is another technique to increase the convergence speed of the LMS algorithm when the input signal is highly correlated. The basic idea behind this methodology is to modify the input signal to be applied to the adaptive filter such that the conditioning number of the corresponding correlation matrix is improved.

In the transform-domain LMS algorithm, the input signal vector x(k) is transformed in a more convenient vector s(k), by applying an orthonormal (or unitary) transform [1012], i.e.,

$$\bf{s}(k) = \bf{T}\bf{x}(k)$$
(4.63)

where TT T = I. The MSE surface related to the direct-form implementation of the FIR adaptive filter can be described by

$$\xi (k) = {\xi }_{\mathrm{min}} + \Delta {{\bf {w}}}^{T}(k)\bf{R}\Delta {\bf {w}}(k)$$
(4.64)

where \(\Delta {\bf {w}}(k) = {\bf {w}}(k) -{{\bf {w}}}_{o}\). In the transform-domain case, the MSE surface becomes

$$\begin{array}{rcl} \xi (k)& =& {\xi }_{\mathrm{min}} + \Delta {\hat{{\bf {w}}}}^{T}(k)E[\bf{s}(k){\bf{s}}^{T}(k)]\Delta \hat{{\bf {w}}}(k) \\ & =& {\xi }_{\mathrm{min}} + \Delta {\hat{{\bf {w}}}}^{T}(k)\bf{T}\bf{R}{\bf{T}}^{T}\Delta \hat{{\bf {w}}}(k) \end{array}$$
(4.65)

where \(\hat{{\bf {w}}}(k)\) represents the adaptive coefficients of the transform-domain filter. Fig. 4.3 depicts the transform-domain adaptive filter.

Fig. 4.3
figure 3

Transform-domain adaptive filter

The effect of applying the transformation matrix T to the input signal is to rotate the error surface as illustrated in the numerical examples of Figs. 4.4 and 4.5. It can be noticed that the eccentricity of the MSE surface remains unchanged by the application of the transformation, and, therefore, the eigenvalue spread is unaffected by the transformation. As a consequence, no improvement in the convergence rate is expected to occur. However, if in addition each element of the transform output is power normalized, the distance between the points where the equal-error contours (given by the ellipses) meet the coefficient axes (\(\Delta \hat{{w}}_{0}\) and \(\Delta \hat{{w}}_{1}\)) and the origin (point 0 ×0) are equalized. As a result, a reduction in the eigenvalue spread is expected, especially when the coefficient axes are almost aligned with the principal axes of the ellipses. Fig. 4.6 illustrates the effect of power normalization. The perfect alignment and power normalization means that the error surface will become a hyperparaboloid spheric, with the eigenvalue spread becoming equal to one. Alternatively, it means that the transform was able to turn the elements of the vector s(k) uncorrelated. Fig. 4.7 shows another error surface which after properly rotated and normalized is transformed into the error surface of Fig. 4.8. \(\bf{T}\bf{R} = \left [\begin{array}{*{10}c} 1 & 0\\ 0 & 1\end{array} \right ]\)

Fig. 4.4
figure 4

\(\bf{R} = \left [\begin{array}{*{10}c} 1 &-0.9\\ -0.9 & 1\end{array} \right ]\) Contours of the original MSE surface

Fig. 4.5
figure 5

\( \bf{T} = \left [\begin{array}{*{10}c} \mathrm{cos}\,\theta & \mathrm{sin}\,\theta \\ \mathrm{-sin } \,\theta & \mathrm{cos } \,\theta \end{array} \right ]\ \theta = 6{0}^{\mathrm{o}}\)Rotated contours of the MSE surface

Fig. 4.6
figure 6

Contours of the power normalized MSE surface

Fig. 4.7
figure 7

\(\bf{R} = \left [\begin{array}{*{10}c} 1 & 0.92\\ 0.92 & 1\end{array} \right ]\) Contours of the original MSE surface

Fig. 4.8
figure 8

Contours of the rotated and power normalized MSE surface

The autocorrelation matrix related to the transform-domain filter is given by

$${ \bf{R}}_{s} = \bf{T}\bf{R}{\bf{T}}^{T}$$
(4.66)

therefore if the elements of s(k) are uncorrelated, matrix R s is diagonal, meaning that the application of the transformation matrix was able to diagonalize the autocorrelation matrix R. It can then be concluded that T T, in this case, corresponds to a matrix whose columns consist of the orthonormal eigenvectors of R. The resulting transformation matrix corresponds to the Karhunen-Loève Transform (KLT) [28].

The normalization of s(k) and subsequent application of the LMS algorithm would lead to a transform-domain algorithm with the limitation that the solution would be independent of the input signal power. An alternative solution, without this limitation, is to apply the normalized LMS algorithm to update the coefficients of the transform-domain algorithm. We can give an interpretation for the good performance of this solution. Assuming the transform was efficient in the rotation of the MSE surface, the variable convergence factor is large in the update of the coefficients corresponding to low signal power. On the other hand, the convergence factor is small if the corresponding transform output power is high. Specifically, the signals s i (k) are normalized by their power denoted by σ i 2(k) only when applied in the updating formula. The coefficient update equation in this case is

$$\hat{{w}}_{i}(k + 1) =\hat{ {w}}_{i}(k) + \frac{2\mu } {\gamma + {\sigma }_{i}^{2}(k)}e(k){s}_{i}(k)$$
(4.67)

where \({\sigma }_{i}^{2}(k) = \alpha {s}_{i}^{2}(k) + (1 - \alpha ){\sigma }_{i}^{2}(k - 1)\), α is a small factor chosen in the range 0 < α ≤ 0. 1, and γ is also a small constant to avoid that the second term of the update equation becomes too large when σ i 2(k) is small.

In matrix form the above updating equation can be rewritten as

$$\hat{{\bf {w}}}(k + 1) = \hat{{\bf {w}}}(k) + 2\mu e(k){{\Sigma }}^{-2}(k)\bf{s}(k)$$
(4.68)

where Σ  − 2(k) is a diagonal matrix containing as elements the inverse of the power estimates of the elements of s(k) added to γ.

It can be shown that if μ is chosen appropriately, the adaptive-filter coefficients converge to

$$\hat{{{\bf {w}}}}_{o} ={{ \bf{R}}_{s}}^{-1}{\bf{p}}_{ s}$$
(4.69)

where R s  = TRT T and p s  = Tp. As a consequence, the optimum coefficient vector is

$$\hat{{{\bf {w}}}}_{o} ={ (\bf{T}\bf{R}{\bf{T}}^{T})}^{-1}\bf{T}\bf{p} = \bf{T}{\bf{R}}^{-1}\bf{p} = \bf{T}{{\bf {w}}}_{ o}$$
(4.70)

The convergence speed of the coefficient vector \(\hat{{\bf {w}}}(k)\) is determined by the eigenvalue spread of Σ  − 2(k)R s .

The requirement on the transformation matrix is that it should be invertible. If the matrix T is not square (number of columns larger than rows), the space spanned by the polynomials formed with the rows of T will be of dimension N + 1, but these polynomials are of order larger than N. This subspace does not contain the complete space of polynomials of order N. In general, except for very specific desired signals, the entire space of Nth-order polynomials would be required. For an invertible matrix T there is a one-to-one correspondence between the solutions obtained by the LMS and transform-domain LMS algorithms. Although the transformation matrix is not required to be unitary, it appears that no advantages are obtained by using nonunitary transforms [13].

The best unitary transform for the transform-domain adaptive filter is the KLT. However, since the KLT is a function of the input signal, it cannot be efficiently computed in real time. An alternative is to choose a unitary transform that is close to the KLT of the particular input signal. By close is meant that both transforms perform nearly the same rotation of the MSE surface. In any situation, the choice of an appropriate transform is not an easy task. Some guidelines can be given, such as: (a) Since the KLT of a real signal is real, the chosen transform should be real for real input signals; (b) For speech signals the discrete-time cosine transform (DCT) is a good approximation for the KLT [30]; (c) Transforms with fast algorithms should be given special attention.

A number of real transforms such as DCT, discrete-time Hartley transform, and others, are available [30]. Most of them have fast algorithms or can be implemented in recursive frequency-domain format. In particular, the outputs of the DCT are given by

$${s}_{0}(k) = \frac{1} {\sqrt{N + 1}}\sum\limits_{l=0}^{N}x(k - l)$$
(4.71)

and

$${s}_{i}(k) = \sqrt{ \frac{2} {N + 1}}\sum\limits_{l=0}^{N}x(k - l)\cos \left [\pi i \frac{(2l + 1)} {2(N + 1)}\right ]$$
(4.72)

From Fig. 4.3, we observe that the delay line and the unitary transform form a single-input and multiple-output preprocessing filter. In case the unitary transform is the DCT, the transfer function from the input to the outputs of the DCT preprocessing filter can be described in a recursive format as follows:

$${T}_{i}(z) = \frac{{k}_{0}} {N + 1}\:\cos {\tau }_{i}\:\frac{[{z}^{N+1} - {(-1)}^{i}](z - 1)} {{z}^{N}[{z}^{2} - (2\cos 2{\tau }_{i})z + 1]}$$
(4.73)

where

$$\begin{array}{rcl}{ k}_{0} = \left \{\begin{array}{ccc} \sqrt{2}&\textit{if}& i = 0\\ 2 &\textit{if} &i = 1,...,N \end{array} \right.& & \\ \end{array}$$

and \({\tau }_{i} = \frac{\pi i} {2(N+1)}\). The derivation details are not given here, since they are beyond the scope of this text.

For complex input signals, the discrete-time Fourier transform (DFT) is a natural choice due to its efficient implementations.

Although no general procedure is available to choose the best transform when the input signal is not known a priori, the decorrelation performed by the transform, followed by the power normalization, is sufficient to reduce the eigenvalue spread for a broad (not all) class of input signals. Therefore, the transform-domain LMS algorithms are expected to converge faster than the standard LMS algorithm in most applications [13].

The complete transform-domain LMS algorithm is outlined on Algorithm 4.4.

Example 4.2.

Repeat the equalization problem of example 3.1 of the previous chapter using the transform-domain LMS algorithm.

  1. (a)

    Compute the Wiener solution.

  2. (b)

    Choose an appropriate value for μ and plot the convergence path for the transform-domain LMS algorithm on the MSE surface.

Solution.

  1. (a)

    In this example, the correlation matrix of the adaptive-filter input signal is given by

    $$\begin{array}{rcl} {\bf{R}} = \left [\begin{array}{cc} 1.6873 & - 0.7937\\ - 0.7937 & 1.6873\\ \end{array} \right ]& & \\ \end{array}$$

    and the cross-correlation vector p is

    $$\begin{array}{rcl} \bf{p} = \left [\begin{array}{c} 0.9524\\ 0.4762\\ \end{array} \right ]& & \\ \end{array}$$

    For square matrix R of dimension 2, the transformation matrix corresponding to the cosine transform is given by

    $$\begin{array}{rcl} \bf{T} = \left [\begin{array}{cc} \frac{\sqrt{2}} {2} & \frac{\sqrt{2}} {2} \\ \frac{\sqrt{2}} {2} & -\frac{\sqrt{2}} {2}\\ \end{array} \right ]& & \\ \end{array}$$

    For this filter order, the above transformation matrix coincides with the KLT.

    The coefficients corresponding to the Wiener solution of the transform-domain filter are given by

    $$\begin{array}{rcl} \hat{{{\bf {w}}}}_{o}& =& {(\bf{T}\bf{R}{\bf{T}}^{T})}^{-1}\bf{T}\bf{p} \\ & =& \left [\begin{array}{cc} \frac{1} {0.8936} & 0 \\ 0 & \frac{1} {2.4810}\\ \end{array} \right ]\left [\begin{array}{c} 1.0102\\ 0.3367\\ \end{array} \right ] \\ & =& \left [\begin{array}{c} 1.1305\\ 0.1357\\ \end{array} \right ] \\ \end{array}$$
  2. (b)

    The transform-domain LMS algorithm is applied to minimize the MSE using a small convergence factor \(\mu = 1/300\), in order to obtain a smoothly converging curve. The convergence path of the algorithm in the MSE surface is depicted in Fig. 4.9. As can be noted, the transformation aligned the coefficient axes with the main axes of the ellipses belonging to the error surface. The reader should notice that the algorithm follows an almost straight path to the minimum and that the effect of the eigenvalue spread is compensated by the power normalization. The convergence in this case is faster than for the LMS case. □ 

From the transform-domain LMS algorithm point of view, we can consider that the LMS-Newton algorithm attempts to utilize an estimate of the KLT through \({\hat{\bf{R}}}^{-1}(k)\). On the other hand, the normalized LMS algorithm utilizes an identity transform with an instantaneous estimate of the input signal power given by x T(k)x(k).

Fig. 4.9
figure 9

Convergence path of the transform-domain adaptive filter

4.6 The Affine Projection Algorithm

There are situations where it is possible to recycle the old data signal in order to improve the convergence of the adaptive-filtering algorithms. Data-reusing algorithms [1824, 31] are considered an alternative to increase the speed of convergence in adaptive-filtering algorithms in situations where the input signal is correlated. The penalty to be paid by data reusing is increased algorithm misadjustment, and, as usual, a trade-off between final misadjustment and convergence speed is achieved through the introduction of a convergence factor.

Let’s assume we keep the last L + 1 input signal vectors in a matrix as follows:

$$\begin{array}{rcl}{ \bf{X}}_{\mathrm{ap}}(k)& =& \left [\begin{array}{ccccc} x(k) & x(k - 1) &\cdots & x(k - L + 1) & x(k - L) \\ x(k - 1) & x(k - 2) &\cdots & x(k - L) & x(k - L - 1)\\ \vdots & \vdots & \ddots & \vdots & \vdots \\ x(k - N)&x(k - N - 1)&\cdots &x(k - L - N + 1)&x(k - L - N) \end{array} \right ] \\ & =& [\bf{x}(k)\:\bf{x}(k - 1)\ldots \bf{x}(k - L)] \end{array}$$
(4.74)

We can also define some vectors representing the partial reusing results at a given iteration k, such as the adaptive-filter output, the desired signal, and the error vectors.

These vectors are

$$\begin{array}{rcl}{ \bf{y}}_{\mathrm{ap}}(k)& =&{ \bf{X}}_{\mathrm{ap}}^{T}(k){\bf {w}}(k) = \left [\begin{array}{c} {y}_{\mathrm{ap},0}(k) \\ {y}_{\mathrm{ap},1}(k)\\ \vdots \\ {y}_{\mathrm{ap},L}(k) \end{array} \right ] <EquationNumber>4.75</EquationNumber> \\ {\bf{d}}_{\mathrm{ap}}(k)& =& \left [\begin{array}{c} d(k) \\ d(k - 1)\\ \vdots \\ d(k - L) \end{array} \right ] <EquationNumber>4.76</EquationNumber> \\ {\bf{e}}_{\mathrm{ap}}(k)& =& \left [\begin{array}{c} {e}_{\mathrm{ap},0}(k) \\ {e}_{\mathrm{ap},1}(k)\\ \vdots \\ {e}_{\mathrm{ap},L}(k) \end{array} \right ] = \left [\begin{array}{c} d(k) - {y}_{\mathrm{ap},0}(k) \\ d(k - 1) - {y}_{\mathrm{ap},1}(k)\\ \vdots \\ d(k - L) - {y}_{\mathrm{ap},L}(k) \end{array} \right ] ={ \bf{d}}_{\mathrm{ap}}(k) -{\bf{y}}_{\mathrm{ap}}(k)\qquad \end{array}$$
(4.77)

The objective of the affine projection algorithm is to minimize

$$\begin{array}{rcl} & & \frac{1} {2}\|{\bf {w}}(k + 1) -{\bf {w}}{(k)\|}^{2}\:\: \\ & & \mathrm{subject\:to :} \\ & &{ \bf{d}}_{\mathrm{ap}}(k) -{\bf{X}}_{\mathrm{ap}}^{T}(k){\bf {w}}(k + 1) = \bf{0}\end{array}$$
(4.78)

The affine projection algorithm maintains the next coefficient vector w(k + 1) as close as possible to the current oneFootnote 1 w(k), while forcing the a posterioriFootnote 2 error to be zero.

Using the method of Lagrange multipliers to turn the constrained minimization into an unconstrained one, the unconstrained function to be minimized is

$$\begin{array}{rcl} F[{\bf {w}} (k + 1)] = \frac{1} {2}\|{\bf {w}}(k + 1) -{\bf {w}}{(k)\|}^{2} +{ {\lambda }}_{\mathrm{ ap}}^{T}(k)[{\bf{d}}_{\mathrm{ ap}}(k) -{\bf{X}}_{\mathrm{ap}}^{T}(k){\bf {w}}(k + 1)]& &\end{array}$$
(4.79)

where λ ap(k) is an (L + 1) ×1 vector of Lagrange multipliers. The above expression can be rewritten as

$$\begin{array}{rcl} F[{\bf {w}}(k + 1)]& =& \frac{1} {2}{\left [{\bf {w}}(k + 1) -{\bf {w}}(k)\right ]}^{T}\left [{\bf {w}}(k + 1) -{\bf {w}}(k)\right ] \\ & & +\left [{\bf{d}}_{\mathrm{ap}}^{T}(k) -{{\bf {w}}}^{T}(k + 1){\bf{X}}_{\mathrm{ ap}}(k)\right ]{{\lambda }}_{\mathrm{ap}}(k)\end{array}$$
(4.80)

The gradient of F[w(k + 1)] with respect to w(k + 1) is given by

$$\begin{array}{rcl}{ \bf{g}}_{{\bf {w}}}\left \{F[{\bf {w}}(k + 1)]\right \} = \frac{1} {2}\left [2{\bf {w}}(k + 1) - 2{\bf {w}}(k)\right ] -{\bf{X}}_{\mathrm{ap}}(k){{\lambda }}_{\mathrm{ap}}(k)& &\end{array}$$
(4.81)

After setting the gradient of F[w(k + 1)] with respect to w(k + 1) equal to zero, we get

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) +{ \bf{X}}_{\mathrm{ap}}(k){{\lambda }}_{\mathrm{ap}}(k)& &\end{array}$$
(4.82)

If we substitute (4.82) in the constraint relation of (4.78), we obtain

$$\begin{array}{rcl}{ \bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k){{\lambda }}_{\mathrm{ap}}(k) ={ \bf{d}}_{\mathrm{ap}}(k) -{\bf{X}}_{\mathrm{ap}}^{T}(k){\bf {w}}(k) ={ \bf{e}}_{\mathrm{ ap}}(k)& &\end{array}$$
(4.83)

The update equation is now given by (4.82) with λ ap(k) being the solution of (4.83), i.e.,

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) +{ \bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k)& &\end{array}$$
(4.84)

The above algorithm corresponds to the conventional affine projection algorithm [20] with unity convergence factor. A trade-off between final misadjustment and convergence speed is achieved through the introduction of a convergence factor as follows

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) + \mu {\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k)& &\end{array}$$
(4.85)

Note that with the convergence factor the a posteriori error is no longer zero. In fact, when measurement noise is present in the environment, zeroing the a posteriori error is not a good idea since we are forcing the adaptive filter to compensate for the effect of a noise signal which is uncorrelated with the adaptive-filter input signal. The result is a high misadjustment when the convergence factor is one. The description of the affine projection algorithm is given in Algorithm 4.5, where an identity matrix multiplied by a small constant was added to the matrix X ap T(k)X ap(k) in order to avoid numerical problems in the matrix inversion. The order of the matrix to be inverted depends on the number of data vectors being reused.

Let’s define the hyperplane \(\mathcal{S}(k)\) as follows

$$\begin{array}{rcl} \mathcal{S}(k) =\{ {\bf {w}}(k + 1) \in {\mathbb{R}}^{N+1} : d(k) -{{\bf {w}}}^{T}(k + 1)\bf{x}(k) = 0\}& &\end{array}$$
(4.86)

It is noticed that the a posteriori error over this hyperplane is zero, that is, given the current input data stored in the vector x(k) the coefficients are updated to a point where the error computed with the coefficients updated is zero. This definition allows an insightful geometric interpretation for the affine projection algorithm.

In the affine projection algorithm the coefficients are computed such that they belong to an L + 1-dimensional subspace \(\in {\mathbb{R}}^{N+1}\), where \(\mathbb{R}\) represents the set of real numbers, spanned by the L + 1 columns of X ap(k). The objective of having L + 1 a posteriori errors equal to zero has infinity number of solutions, such that any solution on \(\mathcal{S}(k)\) can be added to a coefficient vector lying on \({\mathcal{S}}^{\perp }(k)\). By also minimizing \(\frac{1} {2}\|{\bf {w}}(k + 1) -{\bf {w}}{(k)\|}^{2}\) specifies a solution with minimum disturbance. The matrix X ap(k)(X ap T(k)X ap(k)) − 1 X ap T(k) represents an orthogonal projection operator on the L + 1-dimensional subspace of \({\mathbb{R}}^{N+1}\) spanned by the L + 1 columns of X ap(k). This projection matrix has L + 1 eigenvalues equal to 1 and N − L eigenvalues of value 0. On the other hand, the matrix \(\bf{I} - \mu {\bf{X}}_{\mathrm{ap}}(k){({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ap}}(k))}^{-1}{\bf{X}}_{\mathrm{ap}}^{T}(k)\) has L + 1 eigenvalues equal to 1 and N − L eigenvalues of value 1 − μ.

When L = 0 and L = 1 the affine projection algorithm has the normalized LMS and binormalized LMS algorithms [22] as special cases, respectively. In the binormalized case the matrix inversion has closed form solution. Figure 4.10 illustrates the updating of the coefficient vector for a two-dimensional problem for the LMS algorithm, for the normalized LMS algorithm, for the normalized LMS algorithm with a single data reuseFootnote 3, and the binormalized LMS algorithm. Here we assume that the coefficients are originally at \(\tilde{{\bf {w}}}\) when the new data vector x(k) becomes available and x(k − 1) is still stored, and this scenario is used to illustrate the coefficient updating of related algorithms. In addition, it is assumed an environment with no additional noise and a system identification with sufficient order, where the LMS algorithm utilizes a small convergence factor whereas the remaining algorithms use unit convergence factor. The conventional LMS algorithm takes a step towards \(\mathcal{S}(k)\) yielding a solution w(k + 1), anywhere between points 1 and 3 in Fig. 4.10, that is closer to \(\mathcal{S}(k)\) than \(\tilde{{\bf {w}}}\). The NLMS algorithm with unit convergence factor performs a line search in the direction of x(k) to yield in a single step the solution w(k + 1), represented by point 3 in Fig. 4.10, which belongs to \(\mathcal{S}(k)\). A single reuse of the previous data using normalized LMS algorithm would lead to point 4. The binormalized LMS algorithm, which corresponds to an affine projection algorithm with two projections, yields the solution that belongs to \(\mathcal{S}(k - 1)\) and \(\mathcal{S}(k)\), represented by point 5 in Fig. 4.10. As an illustration, it is possible to observe in Fig. 4.11 that by repeatedly re-utilizing the data vectors x(k) and x(k − 1) to update the coefficients with the normalized LMS algorithm would reach point 5 in a zig-zag pattern after an infinite number of iterations. This approach is known as Kaczmarz method [22].

Fig. 4.10
figure 10

Coefficient vector updating for the normalized LMS algorithm and binormalized LMS algorithm

Fig. 4.11
figure 11

Multiple data reuse for the normalized LMS algorithm

For a noise-free environment and sufficient-order identification problem, the optimal solution w o is at the intersection of L + 1 hyperplanes constructed with linearly independent input signal vectors. The affine projection algorithm with unit convergence factor updates the coefficient to the intersection. Figure 4.12 illustrates the coefficient updating for a three-dimensional problem for the normalized and binormalized LMS algorithms. It can be observed in Fig. 4.12 that x(k) and, consequently, g w [e 2(k)] are orthogonal to the hyperplane \(\mathcal{S}(k)\). Similarly, x(k − 1) is orthogonal to the hyperplane \(\mathcal{S}(k - 1)\). The normalized LMS algorithm moves the coefficients from point 1 to point 2, whereas the binormalized LMS algorithm updates the coefficients to point 3 at the intersection of the two hyperplanes.

Fig. 4.12
figure 12

Three-dimensional coefficient vector updating for the normalized LMS algorithm and binormalized LMS algorithm

The affine projection algorithm combines data reusing, orthogonal projections of L consecutive gradient directions, and normalization in order to achieve faster convergence than many other LMS-based algorithms. At each iteration, the affine projection algorithm yields the solution w(k + 1) which is at the intersection of hyperplanes \(\mathcal{S}(k),\mathcal{S}(k - 1),\ldots,\mathcal{S}(k - L)\) and is as close as possible to w(k). The computational complexity of the affine projection algorithm is related to the number of data vectors being reused which ultimately determines the order of the matrix to be inverted. Some fast versions of the algorithm can be found in [21, 26]. It is also possible to reduce computations by employing data-selective strategies as will be discussed in Chapter 6.

4.6.1 Misadjustment in the Affine Projection Algorithm

The analysis of the affine projection algorithm is somewhat more involved than some of the LMS-based algorithms. The following framework provides an alternative analysis approach utilizing the concept of energy conservation [3236]. This framework has been widely used in recent literature to analyze several adaptive-filtering algorithms [36]. In particular, the approach is very useful to analyze the behavior of the affine projection algorithm in a rather simple manner [35].

A general adaptive-filtering algorithm utilizes the following coefficient updating form

$${\bf {w}}(k + 1) = {\bf {w}}(k) - \mu {\bf{F}}_{\bf{x}}(k){\bf{f}}_{\bf{e}}(k)$$
(4.87)

where F x (k) is a matrix whose elements are functions of the input data and f e (k) is a vector whose elements are functions of the error. Assuming that the desired signal is given by

$$d(k) ={ {\bf {w}}}_{o}^{T}\bf{x}(k) + n(k)$$
(4.88)

the underlying updating equation can be alternatively described by

$$\Delta {\bf {w}}(k + 1) = \Delta {\bf {w}}(k) - \mu {\bf{F}}_{\bf{x}}(k){\bf{f}}_{\bf{e}}(k)$$
(4.89)

where \(\Delta {\bf {w}}(k) = {\bf {w}}(k) -{{\bf {w}}}_{o}\).

In the case of the affine projection algorithm

$$\begin{array}{rcl}{ \bf{f}}_{\bf{e}}(k) = -{\bf{e}}_{\mathrm{ap}}(k)& &\end{array}$$
(4.90)

according to (4.77). By premultiplying (4.89) by the input vector matrix of (4.74), the following expressions result

$$\begin{array}{rcl}{ \bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k + 1)& =&{ \bf{X}}_{\mathrm{ ap}}^{T}(k)\Delta {\bf {w}}(k) + \mu {\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{F}}_{\bf{x}}(k){\bf{e}}_{\mathrm{ap}}(k) \\ -\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k)& =& -\tilde{{\bf{e}}}_{\mathrm{ap}}(k) + \mu {\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{F}}_{\bf{x}}(k){\bf{e}}_{\mathrm{ap}}(k) \end{array}$$
(4.91)

where

$$\begin{array}{rcl} \tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k) = -{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k + 1)& &\end{array}$$
(4.92)

is the noiseless a posteriori error vector and

$$\begin{array}{rcl} \tilde{{\bf{e}}}_{\mathrm{ap}}(k) = -{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k) ={ \bf{e}}_{\mathrm{ ap}}(k) -{\mathbf{n}}_{\mathrm{ap}}(k)& &\end{array}$$
(4.93)

is the noiseless a priori error vector with

$$\begin{array}{rcl}{ \mathbf{n}}_{\mathrm{ap}}(k) = \left [\begin{array}{c} n(k) \\ n(k - 1)\\ \vdots \\ n(k - L) \end{array} \right ]& & \\ \end{array}$$

being the standard noise vector.

For the regularized affine projection algorithm

$$\begin{array}{rcl}{ \bf{F}}_{\bf{x}}(k) ={ \bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}& & \\ \end{array}$$

where the matrix γI is added to the matrix to be inverted in order to avoid numerical problems in the inversion operation in the cases X ap T(k)X ap(k) is ill conditioned.

By solving (4.91), we get

$$\begin{array}{rcl} \frac{1} {\mu }{\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\left (\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) -\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k)\right ) ={ \left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k)& & \\ \end{array}$$

If we replace the above equation in

$$\Delta {\bf {w}}(k + 1) = \Delta {\bf {w}}(k) + \mu {\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k)$$
(4.94)

which corresponds to (4.89) for the affine projection case, it is possible to deduce that

$$\begin{array}{rcl} & & \Delta {\bf {w}}(k + 1) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) \\ & & \quad = \Delta {\bf {w}}(k) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\end{array}$$
(4.95)

From the above equation it is possible to prove that

$$\begin{array}{rcl} & & E\left [\|\Delta {\bf {w}}{(k + 1)\|}^{2}\right ] + E\left [\tilde{{\bf{e}}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ] \\ & & \quad = E\left [\|\Delta {\bf {w}}{(k)\|}^{2}\right ] + E\left [\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ]\end{array}$$
(4.96)

Proof.

One can now calculate the Euclidean norm of both sides of (4.95)

$$\begin{array}{rcl} & &{ \left [\Delta {\bf {w}}(k + 1) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ]}^{T} \\ & & \qquad \times \left [\Delta {\bf {w}}(k + 1) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ] \\ & & \quad ={ \left [\Delta {\bf {w}}(k) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ]}^{T} \\ & & \qquad \times \left [\Delta {\bf {w}}(k) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ] \\ \end{array}$$

By performing the inner products one by one, the above equation becomes

$$\begin{array}{rcl} & & \Delta {{\bf {w}}}^{T}(k + 1)\Delta {\bf {w}}(k + 1) - \Delta {{\bf {w}}}^{T}(k + 1){\bf{X}}_{\mathrm{ ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) \\ & & \qquad -{\left [{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ]}^{T}\Delta {\bf {w}}(k + 1) \\ & & \qquad +{ \left [{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ]}^{T}\left [{\bf{X}}_{\mathrm{ ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ] \\ & & \quad = \Delta {{\bf {w}}}^{T}(k)\Delta {\bf {w}}(k) - \Delta {{\bf {w}}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k) \\ & & \qquad -{\left [{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ]}^{T}\Delta {\bf {w}}(k) \\ & & \qquad +{ \left [{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ]}^{T}\left [{\bf{X}}_{\mathrm{ ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ] \\ \end{array}$$

Since \(\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k) = -{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k + 1)\) and \(\tilde{{\bf{e}}}_{\mathrm{ap}}(k) = -{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k)\)

$$\begin{array}{rcl} & & \|\Delta {\bf {w}}{(k + 1)\|}^{2} +\tilde{{ \mathbf{\epsilon }}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) \\ & & \quad +\tilde{{ \bf{e}}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k) +\tilde{{ \bf{e}}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) \\ & & \quad =\| \Delta {\bf {w}}{(k)\|}^{2} +\tilde{{ \bf{e}}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k) \\ & & \qquad +\tilde{{ \mathbf{\epsilon }}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) +\tilde{{ \mathbf{\epsilon }}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k) \\ \end{array}$$

By removing the equal terms on both sides of the last equation the following equality holds

$$\begin{array}{rcl} & & \|\Delta {\bf {w}}{(k + 1)\|}^{2} +\tilde{{ \bf{e}}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) \\ & & \quad =\| \Delta {\bf {w}}{(k)\|}^{2} +\tilde{{ \mathbf{\epsilon }}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\end{array}$$
(4.97)

As can be observed no approximations were utilized so far. Now by applying the expected value operation on both sides of the above equation, the expression of (4.96) holds. □ 

If it is assumed that the algorithm has converged, that is, the coefficients remain in average unchanged, then \(E\left [\|\Delta {\bf {w}}{(k + 1)\|}^{2}\right ] = E\left [\|\Delta {\bf {w}}{(k)\|}^{2}\right ]\). As a result the following equality holds in the steady state.

$$\begin{array}{rcl} E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ] = E\left [\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ]& &\end{array}$$
(4.98)

In the above expression it is useful to remove the dependence on the a posteriori error, what can be achieved by applying (4.91) to the affine projection algorithm case.

$$\begin{array}{rcl} \tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k) =\tilde{{ \bf{e}}}_{\mathrm{ap}}(k) - \mu {\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k)& &\end{array}$$
(4.99)

By substituting (4.98) in (4.99) we get

$$\begin{array}{rcl} E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ]& = E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right. & \\ & \quad \qquad - \mu \tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ap}}(k) & \\ & \quad \qquad - \mu {\bf{e}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ap}}(k) + \gamma \bf{I}\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ap}}(k) & \\ & \quad \qquad + {\mu }^{2}{\bf{e}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ap}}(k) + \gamma \bf{I}\right )}^{-1} & \\ & \quad \qquad \times \left.{\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ap}}(k)\right ]&\end{array}$$
(4.100)

The above expression can be simplified as

$$\begin{array}{rcl} & & {\mu }^{2}E\left [{\bf{e}}_{\mathrm{ ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{e}}_{\mathrm{ap}}(k)\right ] \\ & & \quad = \mu E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k){\bf{e}}_{\mathrm{ap}}(k) +{ \bf{e}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\right ]\end{array}$$
(4.101)

where the following definitions are employed to simplify the discussion

$$\begin{array}{rcl} \hat{{\bf{R}}}_{\mathrm{ap}}(k)& =&{ \bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) \\ \hat{{\mathbf{S}}}_{\mathrm{ap}}(k)& =&{ \left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}\end{array}$$
(4.102)

By rescuing the definition of the error squared of (3.39) and applying the expected value operator we obtain

$$\begin{array}{rcl} \xi (k) = E[{e}^{2}(k)] = E[{n}^{2}(k)] - 2E[n(k)\Delta {{\bf {w}}}^{T}(k)\bf{x}(k)] + E[\Delta {{\bf {w}}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k)\Delta {\bf {w}}(k)]& & \\ & &\end{array}$$
(4.103)

If the coefficients have weak dependency of the additional noise and applying the orthogonality principle, we can simplify the above expression as follows

$$\begin{array}{rcl} \xi (k)& =& {\sigma }_{n}^{2} + E[\Delta {{\bf {w}}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k)\Delta {\bf {w}}(k)] \\ & =& {\sigma }_{n}^{2} + E[\tilde{{e}}_{\mathrm{ ap,0}}^{2}(k)] \end{array}$$
(4.104)

where \(\tilde{{e}}_{\mathrm{ap,0}}(k)\) is the first element of vector \(\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\).

In order to compute the excess mean-square error we can remove the value of \(E[\tilde{{e}}_{\mathrm{ap,0}}^{2}(k)]\) from (4.101). Since our aim is to compute \(E[\tilde{{e}}_{\mathrm{ap,0}}^{2}(k)]\), we can substitute (4.93) in (4.101) in order to get rid of e ap(k). The resulting expression is given by

$$\begin{array}{rcl} & & E\left [\mu {(\tilde{{\bf{e}}}_{\mathrm{ap}}(k) +{ \mathbf{n}}_{\mathrm{ap}}(k))}^{T}\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)(\tilde{{\bf{e}}}_{\mathrm{ap}}(k) +{ \mathbf{n}}_{\mathrm{ap}}(k))\right ] \\ & & \quad = E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)(\tilde{{\bf{e}}}_{\mathrm{ap}}(k) +{ \mathbf{n}}_{\mathrm{ap}}(k)) + {(\tilde{{\bf{e}}}_{\mathrm{ap}}(k) +{ \mathbf{n}}_{\mathrm{ap}}(k))}^{T}\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\right ]\end{array}$$
(4.105)

By considering the noise white and statistically independent of the input signal, the above relation can be further simplified as

$$\begin{array}{rcl} & & \mu E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}(k) +{ \mathbf{n}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\mathbf{n}}_{\mathrm{ap}}(k)\right ] \\ & & \quad = 2E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\right ] \end{array}$$
(4.106)

The above expression, after some rearrangements, can be rewritten as

$$\begin{array}{rcl} & & 2E\left \{\mathrm{tr}[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\right \} - \mu E\left \{\mathrm{tr}[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)]\right \} \\ & & \quad = \mu E\left \{\mathrm{tr}[{\mathbf{n}}_{\mathrm{ap}}(k){\mathbf{n}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)]\right \} \end{array}$$
(4.107)

where we used the property tr[AB] = tr[BA].

In addition, if matrix \(\hat{{\bf{R}}}_{\mathrm{ap}}(k)\) is invertible it can be noticed that

$$\begin{array}{rcl} \hat{{\mathbf{S}}}_{\mathrm{ap}}(k)& =&{ \left [\hat{{\bf{R}}}_{\mathrm{ap}}(k) + \gamma \bf{I}\right ]}^{-1} \\ & =& \hat{{\bf{R}}}_{\mathrm{ap}}^{-1}(k)\left [\bf{I} - \gamma \hat{{\bf{R}}}_{\mathrm{ ap}}^{-1}(k) + {\gamma }^{2}\hat{{\bf{R}}}_{\mathrm{ ap}}^{-2}(k) - {\gamma }^{3}\hat{{\bf{R}}}_{\mathrm{ ap}}^{-3}(k) + \cdots \,\right ] \\ & \approx & \hat{{\bf{R}}}_{\mathrm{ap}}^{-1}(k)\left [\bf{I} - \gamma \hat{{\bf{R}}}_{\mathrm{ ap}}^{-1}(k)\right ] \approx \hat{{\bf{R}}}_{\mathrm{ ap}}^{-1}(k) \end{array}$$
(4.108)

where the last two relations are valid for γ ≪ 1.

By assuming that the matrix \(\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\) is statistically independent of the noiseless a priori error after convergence, and of the noise, the (4.107) can be rewritten as

$$\begin{array}{rcl} & & 2\mathrm{tr}\left \{E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)]E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\right \}-\mu \mathrm{tr}\left \{E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)]E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\right \} \\ & & \quad + \gamma \mu \mathrm{tr}\left \{E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)]\right \}\!=\,\mu \mathrm{tr}\left \{E[{\mathbf{n}}_{\mathrm{ ap}}(k){\mathbf{n}}_{\mathrm{ap}}^{T}(k)]E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\right \}-\gamma \mu \mathrm{tr}\left \{E[{\mathbf{n}}_{\mathrm{ap}}(k){\mathbf{n}}_{\mathrm{ap}}^{T}(k)]\right \}\end{array}$$
(4.109)

This equation can be further simplified by assuming the noise is whiteFootnote 4 and γ is small leading to the following expression

$$\begin{array}{rcl} (2 - \mu )\mathrm{tr}\{E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)]E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\} = \mu {\sigma }_{n}^{2}\mathrm{tr}\{E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\}& &\end{array}$$
(4.110)

Our task now is to compute \(E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)]\) where we will assume in the process that this matrix is diagonal dominant whose final result has the following form

$$\begin{array}{rcl} E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)] = \bf{A}E[\tilde{{e}}_{\mathrm{ ap},0}^{2}(k)] + {\mu }^{2}\bf{B}{\sigma }_{ n}^{2}& & \\ \end{array}$$

Proof.

The i-th rows of (4.92) and (4.93) are given by

$$\begin{array}{rcl} \tilde{{\epsilon }}_{\mathrm{ap},i}(k) = -{\bf{x}}^{T}(k - i)\Delta {\bf {w}}(k + 1)& &\end{array}$$
(4.111)

and

$$\begin{array}{rcl} \tilde{{e}}_{\mathrm{ap},i}(k) = -{\bf{x}}^{T}(k - i)\Delta {\bf {w}}(k) = {e}_{\mathrm{ ap},i}(k) - n(k - i)& &\end{array}$$
(4.112)

for i = 0, , L. Using in (4.91) the fact that X ap T(k)F x (k) ≈ I for small γ, then

$$\begin{array}{rcl} -\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k) = -\tilde{{\bf{e}}}_{\mathrm{ap}}(k) + \mu {\bf{e}}_{\mathrm{ap}}(k)& &\end{array}$$
(4.113)

By properly utilizing in (4.111) and (4.112) the i-th row of (4.91), we obtain

$$\begin{array}{rcl} \tilde{{\epsilon }}_{\mathrm{ap},i}(k)& =& -{\bf{x}}^{T}(k - i)\Delta {\bf {w}}(k + 1) \\ & =& (1 - \mu )\tilde{{e}}_{\mathrm{ap},i}(k) - \mu n(k - i) \\ & =& -(1 - \mu ){\bf{x}}^{T}(k - i)\Delta {\bf {w}}(k) - \mu n(k - i)\end{array}$$
(4.114)

Squaring the above equation, assuming the coefficients are weakly dependent on the noise which is in turn white noise, and following closely the procedure to derive (4.96) from (4.95), we get

$$\begin{array}{rcl} E\left [{({\bf{x}}^{T}(k - i)\Delta {\bf {w}}(k + 1))}^{2}\right ] = {(1 - \mu )}^{2}E\left [{({\bf{x}}^{T}(k - i)\Delta {\bf {w}}(k))}^{2}\right ] + {\mu }^{2}{\sigma }_{ n}^{2}& &\end{array}$$
(4.115)

The above expression relates the squared values of the a posteriori and a priori errors. However, the same kind of relation holds for the previous time instant, that is

$$\begin{array}{rcl} E[{({\bf{x}}^{T}(k - i - 1)\Delta {\bf {w}}(k))}^{2}]& =& {(1 - \mu )}^{2}E[{({\bf{x}}^{T}(k - i - 1)\Delta {\bf {w}}(k - 1))}^{2}] + {\mu }^{2}{\sigma }_{ n}^{2}\\ \end{array}$$

or

$$\begin{array}{rcl} E[\tilde{{e}}_{\mathrm{ap},i+1}^{2}(k)]& =& {(1 - \mu )}^{2}E[\tilde{{e}}_{\mathrm{ ap},i}^{2}(k - 1)] + {\mu }^{2}{\sigma }_{ n}^{2}\end{array}$$
(4.116)

Note that for i = 0 this term corresponds to the second diagonal element of the matrix \(E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)]\). Specifically we can compute \(E[\tilde{{e}}_{\mathrm{ap},1}^{2}(k)]\) as

$$\begin{array}{rcl} E[{({\bf{x}}^{T}(k - 1)\Delta {\bf {w}}(k))}^{2}]& =& E[\tilde{{e}}_{\mathrm{ ap},1}^{2}(k)] \\ & =& {(1 - \mu )}^{2}E[{({\bf{x}}^{T}(k - 1)\Delta {\bf {w}}(k - 1))}^{2}] + {\mu }^{2}{\sigma }_{ n}^{2} \\ & =& {(1 - \mu )}^{2}E[\tilde{{e}}_{\mathrm{ ap},0}^{2}(k - 1)] + {\mu }^{2}{\sigma }_{ n}^{2} \end{array}$$
(4.117)

For i = 1 (4.116) becomes

$$\begin{array}{rcl} E[{({\bf{x}}^{T}(k - 2)\Delta {\bf {w}}(k))}^{2}]& =& E[\tilde{{e}}_{\mathrm{ ap},2}^{2}(k)] \\ & =& {(1 - \mu )}^{2}E[{({\bf{x}}^{T}(k - 2)\Delta {\bf {w}}(k - 1))}^{2}] + {\mu }^{2}{\sigma }_{ n}^{2} \\ & =& {(1 - \mu )}^{2}E[\tilde{{e}}_{\mathrm{ ap},1}^{2}(k - 1)] + {\mu }^{2}{\sigma }_{ n}^{2} \end{array}$$
(4.118)

By substituting (4.117) in the above equation it follows that

$$\begin{array}{rcl} E[\tilde{{e}}_{\mathrm{ap},2}^{2}(k)]& =& {(1 - \mu )}^{4}E[\tilde{{e}}_{\mathrm{ ap},0}^{2}(k - 2)] + [1 + {(1 - \mu )}^{2}]{\mu }^{2}{\sigma }_{ n}^{2}\end{array}$$
(4.119)

By induction one can prove that

$$\begin{array}{rcl} E[\tilde{{e}}_{\mathrm{ap},i+1}^{2}(k)]& =& {(1 - \mu )}^{2(i+1)}E[\tilde{{e}}_{\mathrm{ ap},0}^{2}(k - i - 1)] + \left [1 +{ \sum \nolimits }_{l=1}^{i}{(1 - \mu )}^{2l}\right ]{\mu }^{2}{\sigma }_{ n}^{2}\end{array}$$
(4.120)

By assuming that \(E[\tilde{{e}}_{\mathrm{ap},0}^{2}(k)] \approx E[\tilde{{e}}_{\mathrm{ap},0}^{2}(k - i)]\) for i = 0, , L, then

$$\begin{array}{rcl} E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)] = \bf{A}E[\tilde{{e}}_{\mathrm{ ap},0}^{2}(k)] + {\mu }^{2}\bf{B}{\sigma }_{ n}^{2}& &\end{array}$$
(4.121)

with

$$\begin{array}{rcl} \bf{A}& =& \left [\begin{array}{ccccc} 1 \\ &{(1 - \mu )}^{2} & & \bf{0} \\ & &{(1 - \mu )}^{4} \\ & \bf{0} & & \ddots \\ & & & &{(1 - \mu )}^{2L} \end{array} \right ] \\ \bf{B}& =& \left [\begin{array}{cccccc} 0\\ & 1 & && \bf{0} & \\ & &1 + {(1 - \mu )}^{2}\\ & & & \ddots \\ & \bf{0}& &&1 +{ \sum \nolimits }_{l=1}^{i}{(1 - \mu )}^{2l}\\ & & & & \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\ddots \\ & & && &1 +{ \sum \nolimits }_{l=1}^{L-1}{(1 - \mu )}^{2l} \end{array} \right ]\\ \end{array}$$

where it was also considered that the above matrix \(E[\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)]\) was diagonal dominant, as it is usually the case in practice. Note from the above relation that the convergence factor μ should be chosen in the range 0 < μ < 2, so that the elements of the noiseless a priori error remain bounded for any value of L, in practice there is no point in using μ > 1. □ 

We have available all the quantities required to calculate the excess MSE in the affine projection algorithm. Specifically, we can substitute the result of (4.121) in (4.110) obtaining

$$\begin{array}{rcl} (2 - \mu )\left [E[\tilde{{e}}_{\mathrm{ap},0}^{2}(k)]\mathrm{tr}\{\bf{A}E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\}+{\mu }^{2}{\sigma }_{ n}^{2}\mathrm{tr}\{\bf{B}E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\}\right ] = \mu {\sigma }_{n}^{2}\mathrm{tr}\left \{E[\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)]\right \}& &\end{array}$$
(4.122)

The second term on the left-hand side can be neglected in case the signal-to-noise ratio is high. For small μ this term also becomes substantially smaller than the term on the right-hand side. For μ close to one the referred terms become comparable only for large L, when the misadjustment becomes less sensitive to L. In the following discussions we will not consider the term multiplied by μ2.

Assuming the diagonal elements of \(E[\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)]\) are equal and the matrix A multiplying it on the left-hand side is a diagonal matrix, after a few manipulations it is possible to deduce that

$$\begin{array}{rcl} E[\tilde{{e}}_{\mathrm{ap},0}^{2}(k)]& =& \frac{\mu } {2 - \mu }{\sigma }_{n}^{2} \frac{\mathrm{tr}\{E[\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)]\}} {\mathrm{tr}\{\bf{A}E[\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)]\}} \\ & =& \frac{(L + 1)\mu } {2 - \mu } \frac{1 - {(1 - \mu )}^{2}} {1 - {(1 - \mu )}^{2(L+1)}}{\sigma }_{n}^{2}\end{array}$$
(4.123)

Therefore, the misadjustment for the affine projection algorithm is given by

$$\begin{array}{rcl} M = \frac{(L + 1)\mu } {2 - \mu } \frac{1 - {(1 - \mu )}^{2}} {1 - {(1 - \mu )}^{2(L+1)}}& &\end{array}$$
(4.124)

For large L and small 1 − μ, this equation can be approximated by

$$\begin{array}{rcl} M = \frac{(L + 1)\mu } {(2 - \mu )} & &\end{array}$$
(4.125)

In [23], by considering a simplified model for the input signal vector consisting of vectors with discrete angular orientation and the independence assumption, an expression for the misadjustment of the affine projection algorithm was derived, that is

$$\begin{array}{rcl} M = \frac{\mu } {2 - \mu }E\left [ \frac{1} {\|\bf{x}{(k)\|}^{2}}\right ]\mathrm{tr}[\bf{R}]& &\end{array}$$
(4.126)

which is independent of L. It is observed in the experiments that higher number of reuses leads to higher misadjustment, as indicated in (4.125). The equivalent expression of (4.126) using the derivations presented here would lead to

$$\begin{array}{rcl} M = \frac{(L + 1)\mu } {2 - \mu } E\left [ \frac{1} {\|\bf{x}{(k)\|}^{2}}\right ]\mathrm{tr}[\bf{R}]& &\end{array}$$
(4.127)

which can obtained from (4.123) by considering that

$$\begin{array}{rcl} \mathrm{tr}\{E[\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)]\} \approx (L + 1)E\left [ \frac{1} {\|\bf{x}{(k)\|}^{2}}\right ]& & \\ \end{array}$$

and

$$\begin{array}{rcl} \frac{1} {\mathrm{tr}\{\bf{A}E[\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)]\}} \approx \mathrm{tr}[\bf{R}]& & \\ \end{array}$$

for μ close to one.

4.6.2 Behavior in Nonstationary Environments

In a nonstationary environment the error in the coefficients is described by the following vector

$$\begin{array}{rcl} \Delta {\bf {w}}(k + 1) = {\bf {w}}(k + 1) -{{\bf {w}}}_{o}(k + 1)& &\end{array}$$
(4.128)

where w o (k + 1) is the optimal time-varying vector. For this case, (4.95) becomes

$$\begin{array}{rcl} \Delta {\bf {w}}(k + 1) = \Delta \hat{{\bf {w}}}(k) + \mu {\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k)& &\end{array}$$
(4.129)

where \(\Delta \hat{{\bf {w}}}(k) = {\bf {w}}(k) -{{\bf {w}}}_{o}(k + 1)\). By premultiplying the above expression by X ap T(k) it follows that

$$\begin{array}{rcl}{ \bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k + 1)& =&{ \bf{X}}_{\mathrm{ ap}}^{T}(k)\Delta \hat{{\bf {w}}}(k) + \mu {\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k) \\ -\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k)& =& -\tilde{{\bf{e}}}_{\mathrm{ap}}(k) + \mu {\bf{X}}_{\mathrm{ap}}^{T}(k)\mu {\bf{X}}_{\mathrm{ ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k) \end{array}$$
(4.130)

By solving the (4.130), it is possible to show that

$$\begin{array}{rcl} \frac{1} {\mu }{\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\left [\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) -\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}(k)\right ] ={ \left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) + \gamma \bf{I}\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}(k)& &\end{array}$$
(4.131)

Following the same procedure to derive (4.95), we can now substitute (4.131) in (4.129) in order to deduce that

$$\begin{array}{rcl} & & \Delta {\bf {w}}(k + 1) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k) \\ & & \quad = \Delta \hat{{\bf {w}}}(k) -{\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\end{array}$$
(4.132)

By computing the energy on both sides of this equation as previously performed in (4.96), it is possible to show that

$$\begin{array}{rcl} & & E\left [\|\Delta {\bf {w}}{(k + 1)\|}^{2}\right ] + E\left [\tilde{{\bf{e}}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ] \\ & & \quad = E\left [\|\Delta \hat{{\bf {w}}}{(k)\|}^{2}\right ] + E\left [\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ] \\ & & \quad = E\left [\|\Delta {\bf {w}}(k) + \Delta {{\bf {w}}}_{o}{(k + 1)\|}^{2}\right ] + E\left [\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ] \\ & & \quad \approx E\left [\|\Delta {\bf {w}}{(k)\|}^{2}\right ] + E\left [\|\Delta {{\bf {w}}}_{ o}{(k + 1)\|}^{2}\right ] + E\left [\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ]\end{array}$$
(4.133)

where \(\Delta {{\bf {w}}}_{o}(k + 1) ={ {\bf {w}}}_{o}(k) -{{\bf {w}}}_{o}(k + 1)\), and in the last equality we have assumed that \(E\left [\Delta {{\bf {w}}}^{T}(k)\Delta {{\bf {w}}}_{o}(k + 1)\right ] \approx 0\). This assumption is valid for simple models for the time-varying behavior of the unknown system, such as random walk model [30]Footnote 5. We will adopt this assumption in order to simplify our analysis.

The time-varying characteristic of the unknown system leads to an excess mean-square error. As before, in order to calculate the excess MSE we assume that each element of the optimal coefficient vector is modeled as a first-order Markov process. As previously mentioned, this nonstationary environment can be considered somewhat simplified, but allows a manageable mathematical analysis. The first-order Markov process is described by

$$\begin{array}{rcl}{ {\bf {w}}}_{o}(k) = {\lambda }_{{\bf {w}}}{{\bf {w}}}_{o}(k - 1) + {\kappa }_{{\bf {w}}}{\mathbf{n}}_{{\bf {w}}}(k)& &\end{array}$$
(4.134)

where n w (k) is a vector whose elements are zero-mean white noise processes with variance σ w 2, and λ w  < 1. If κ w  = 1 this model may not represent a real system when λ w  → 1, since the E[w o (k)w o T(k)] will have unbounded elements if, for example, n w (k) is not exactly zero mean. A better model utilizes a factor \({\kappa }_{{\bf {w}}} = {(1 - {\lambda }_{{\bf {w}}})}^{{ p \over 2} }\), for p ≥ 1, multiplying n w (k) in order to guarantee that E[w o (k)w o T(k)] is bounded.

In our derivations of the excess MSE, the covariance of \(\Delta {{\bf {w}}}_{o}(k + 1) ={ {\bf {w}}}_{o}(k) -{{\bf {w}}}_{o}(k + 1)\) is required. That is

$$\begin{array}{rcl} \mathrm{cov}[\Delta {{\bf {w}}}_{o}(k + 1)]& =& E\left [({{\bf {w}}}_{o}(k + 1) -{{\bf {w}}}_{o}(k)){({{\bf {w}}}_{o}(k + 1) -{{\bf {w}}}_{o}(k))}^{T}\right ] \\ & =& E\left [({\lambda }_{{\bf {w}}}{{\bf {w}}}_{o}(k) + {\kappa }_{{\bf {w}}}{\mathbf{n}}_{{\bf {w}}}(k) -{{\bf {w}}}_{o}(k)){({\lambda }_{{\bf {w}}}{{\bf {w}}}_{o}(k) + {\kappa }_{{\bf {w}}}{\mathbf{n}}_{{\bf {w}}}(k) -{{\bf {w}}}_{o}(k))}^{T}\right ] \\ & =& E\left \{[({\lambda }_{{\bf {w}}} - 1){{\bf {w}}}_{o}(k) + {\kappa }_{{\bf {w}}}{\mathbf{n}}_{{\bf {w}}}(k)]{[({\lambda }_{{\bf {w}}} - 1){{\bf {w}}}_{o}(k) + {\kappa }_{{\bf {w}}}{\mathbf{n}}_{{\bf {w}}}(k)]}^{T}\right \} \end{array}$$
(4.135)

Since each element of n w (k) is a zero-mean white noise process with variance σ w 2, and λ w  < 1, by applying the result of (2.82), it follows that

$$\begin{array}{rcl} \mathrm{cov}[\Delta {{\bf {w}}}_{o}(k + 1)]& =& {\kappa }_{{\bf {w}}}^{2}{\sigma }_{{\bf {w}}}^{2}\frac{{(1 - {\lambda }_{{\bf {w}}})}^{2}} {1 - {\lambda }_{{\bf {w}}}^{2}} \bf{I} + {\kappa }_{{\bf {w}}}^{2}{\sigma }_{{\bf {w}}}^{2}\bf{I} \\ & =& {\kappa }_{{\bf {w}}}^{2}\left [\frac{1 - {\lambda }_{{\bf {w}}}} {1 + {\lambda }_{{\bf {w}}}} + 1\right ]{\sigma }_{{\bf {w}}}^{2}\bf{I} \end{array}$$
(4.136)

By employing this result, we can compute

$$\begin{array}{rcl} E\left [\|\Delta {{\bf {w}}}_{o}{(k + 1)\|}^{2}\right ] = \mathrm{tr}\{\mathrm{cov}[\Delta {{\bf {w}}}_{ o}(k + 1)]\} = (N + 1)\left [ \frac{2{\kappa }_{{\bf {w}}}^{2}} {1 + {\lambda }_{{\bf {w}}}}\right ]{\sigma }_{{\bf {w}}}^{2}\qquad & &\end{array}$$
(4.137)

We are now in a position to solve (4.133) utilizing the result of (4.137). Again by assuming that the algorithm has converged, that is, the Euclidean norm of the coefficients increment remains in average unchanged, then \(E\left [\|\Delta {\bf {w}}{(k + 1)\|}^{2}\right ] = E\left [\|\Delta {\bf {w}}{(k)\|}^{2}\right ]\). As a result, (4.133) can be rewritten as

$$\begin{array}{rcl} E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\bf{e}}}_{\mathrm{ ap}}(k)\right ]& =& E\left [\tilde{{\mathbf{\epsilon }}}_{\mathrm{ap}}^{T}(k){\left ({\bf{X}}_{\mathrm{ ap}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}\tilde{{\mathbf{\epsilon }}}_{\mathrm{ ap}}(k)\right ] \\ & & +(N + 1)\left [ \frac{2{\kappa }_{{\bf {w}}}^{2}} {1 + {\lambda }_{{\bf {w}}}}\right ]{\sigma }_{{\bf {w}}}^{2} \end{array}$$
(4.138)

Leading to the equivalent of (4.101) as follows

$$\begin{array}{rcl}{ \mu }^{2}E\left [{\bf{e}}_{\mathrm{ ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{e}}_{\mathrm{ap}}(k)\right ]& =& \mu E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k){\bf{e}}_{\mathrm{ap}}(k)\right. \\ & & \qquad \quad \left.+{\bf{e}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\right ] \\ & & +(N + 1)\left [ \frac{2{\kappa }_{{\bf {w}}}^{2}} {1 + {\lambda }_{{\bf {w}}}}\right ]{\sigma }_{{\bf {w}}}^{2}\qquad \qquad \end{array}$$
(4.139)

By solving this equation following precisely the same procedure as (4.101) was solved, we can derive the excess MSE only due to the time-varying unknown system.

$$\begin{array}{rcl}{ \xi }_{\mathrm{lag}} = \frac{N + 1} {\mu (2 - \mu )}\left [ \frac{2{\kappa }_{{\bf {w}}}^{2}} {1 + {\lambda }_{{\bf {w}}}}\right ]{\sigma }_{{\bf {w}}}^{2}& &\end{array}$$
(4.140)

By taking into consideration the additional noise and the time-varying parameters to be estimated, the overall excess MSE is given by

$$\begin{array}{rcl}{ \xi }_{\mathrm{exc}}& =& \frac{(L + 1)\mu } {2 - \mu } \frac{1 - {(1 - \mu )}^{2}} {1 - {(1 - \mu )}^{2(L+1)}}{\sigma }_{n}^{2} + \frac{N + 1} {\mu (2 - \mu )}\left [ \frac{2{\kappa }_{{\bf {w}}}^{2}} {1 + {\lambda }_{{\bf {w}}}}\right ]{\sigma }_{{\bf {w}}}^{2} \\ & =& \frac{1} {2 - \mu }\{(L + 1)\mu \frac{1 - {(1 - \mu )}^{2}} {1 - {(1 - \mu )}^{2(L+1)}}{\sigma }_{n}^{2} + \frac{N + 1} {\mu } \left [ \frac{2{\kappa }_{{\bf {w}}}^{2}} {1 + {\lambda }_{{\bf {w}}}}\right ]{\sigma }_{{\bf {w}}}^{2}\Bigg\}\end{array}$$
(4.141)

If κ w  = 1, L is large, and | 1 − μ |  < 1, the above expression becomes simpler

$$\begin{array}{rcl}{ \xi }_{\mathrm{exc}} = \frac{1} {2 - \mu }\left \{(L + 1)\mu {\sigma }_{n}^{2} + \frac{2(N + 1)} {\mu (1 + {\lambda }_{{\bf {w}}})}{\sigma }_{{\bf {w}}}^{2}\right \}& &\end{array}$$
(4.142)

As can be observed, the contribution due to the lag is inversely proportional to the value of μ. This is an expected result since for small values of μ an adaptive-filtering algorithm will face difficulties in tracking the variations in the unknown system.

4.6.3 Transient Behavior

This subsection presents some considerations related to the behavior of the affine projection algorithm during the transient. In order to achieve this goal we start by removing the dependence of (4.96) on the noiseless a posteriori error through (4.99), very much like previously performed in the derivations of (4.100) and (4.101). The resulting expression is

$$\begin{array}{rcl} E\left [\|\Delta {\bf {w}}{(k + 1)\|}^{2}\right ]& =& E\left [\|\Delta {\bf {w}}{(k)\|}^{2}\right ] + {\mu }^{2}E\left [{\bf{e}}_{\mathrm{ ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{e}}_{\mathrm{ap}}(k)\right ] \\ & & -\mu E\left [\tilde{{\bf{e}}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k){\bf{e}}_{\mathrm{ap}}(k) +{ \bf{e}}_{\mathrm{ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\tilde{{\bf{e}}}_{\mathrm{ap}}(k)\right ]\end{array}$$
(4.143)

Since from (4.93)

$$\begin{array}{rcl}{ \bf{e}}_{\mathrm{ap}}(k)& =& \tilde{{\bf{e}}}_{\mathrm{ap}}(k) +{ \mathbf{n}}_{\mathrm{ap}}(k) \\ & =& -{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k) +{ \mathbf{n}}_{\mathrm{ ap}}(k) \\ \end{array}$$

the above expression (4.143) can be rewritten as

$$\begin{array}{rcl} & & E\left [\|\Delta {\bf {w}}{(k + 1)\|}^{2}\right ] = E\left [\|\Delta {\bf {w}}{(k)\|}^{2}\right ] \\ & & \quad + {\mu }^{2}E\left [\left (-\Delta {{\bf {w}}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)+{\mathbf{n}}_{\mathrm{ap}}^{T}(k)\right )\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\left (-{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k) +{ \mathbf{n}}_{\mathrm{ ap}}(k)\right )\right ] \\ & & \quad - \mu E\left [\left (-\Delta {{\bf {w}}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\left (-{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k) +{ \mathbf{n}}_{\mathrm{ ap}}(k)\right )\right. \\ & & \qquad \qquad \left.+\left (-\Delta {{\bf {w}}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k) +{ \mathbf{n}}_{\mathrm{ap}}^{T}(k)\right )\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\left (-{\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k)\right )\right ] \end{array}$$
(4.144)

By considering the noise white and uncorrelated with the other quantities of this recursion, the above equation can be simplified to

$$\begin{array}{rcl} E\left [\|\Delta {\bf {w}}{(k + 1)\|}^{2}\right ]& =& E\left [\|\Delta {\bf {w}}{(k)\|}^{2}\right ] - 2\mu E\left [\Delta {{\bf {w}}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k)\right ] \\ & & +{\mu }^{2}E\left [\Delta {{\bf {w}}}^{T}(k){\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k)\right ] \\ & & +{\mu }^{2}E\left [{\mathbf{n}}_{\mathrm{ ap}}^{T}(k)\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\mathbf{n}}_{\mathrm{ap}}(k)\right ] \end{array}$$
(4.145)

By applying the property that tr[AB] = tr[BA], this relation is equivalent to

$$\begin{array}{rcl} \mathrm{tr}\{\mathrm{cov}[\Delta {\bf {w}}(k + 1)]\}& =& \mathrm{tr}\left [\mathrm{cov}[\Delta {\bf {w}}(k)]\right ] - 2\mu \mathrm{tr}\left \{E\left [{\bf{X}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k)\Delta {{\bf {w}}}^{T}(k)\right ]\right \} \\ & & +{\mu }^{2}\mathrm{tr}\left \{E\left [{\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\Delta {\bf {w}}(k)\Delta {{\bf {w}}}^{T}(k)\right ]\right \} \\ & & +{\mu }^{2}\mathrm{tr}\left \{E\left [\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\right ]E\left [{\mathbf{n}}_{\mathrm{ap}}(k){\mathbf{n}}_{\mathrm{ap}}^{T}(k)\right ]\right \} \end{array}$$
(4.146)

By assuming that the Δw(k + 1) is independent of the data and the noise is white, it follows that

$$\begin{array}{rcl} \mathrm{tr}\{\mathrm{cov}[\Delta {\bf {w}}(k + 1)]\}& =& \mathrm{tr}\left \{\left [\bf{I} - E\left (2\mu {\bf{X}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right.\right.\right. \\ & & \qquad -\left.\left.\left.{\mu }^{2}{\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right )\right ]\mathrm{cov}[\Delta {\bf {w}}(k)]\right \} \\ & & +{\mu }^{2}{\sigma }_{ n}^{2}\mathrm{tr}\left \{E\left [\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\hat{{\bf{R}}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\right ]\right \} \end{array}$$
(4.147)

Now by recalling that

$$\begin{array}{rcl} \hat{{\mathbf{S}}}_{\mathrm{ap}}(k)& \approx & \hat{{\bf{R}}}_{\mathrm{ap}}^{-1}(k)\left [\bf{I} - \gamma \hat{{\bf{R}}}_{\mathrm{ ap}}^{-1}(k)\right ] \\ \end{array}$$

and by utilizing the unitary matrix Q, that in the present discussion diagonalizes \(E[{\bf{X}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)]\), the following relation is valid

$$\begin{array}{rcl} \mathrm{tr}\left \{\mathrm{cov}[\Delta {\bf {w}}(k + 1)]\mathbf{Q}{\mathbf{Q}}^{T}\right \}& =& \mathrm{tr}\left \{\mathbf{Q}{\mathbf{Q}}^{T}\left [\bf{I} - E\left (2\mu {\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right.\right.\right. \\ & & -\left.\left.\left.(1 - \gamma ){\mu }^{2}{\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right )\right ]\mathbf{Q}{\mathbf{Q}}^{T}\mathrm{cov}[\Delta {\bf {w}}(k)]\mathbf{Q}{\mathbf{Q}}^{T}\right \} \\ & & +(1 - \gamma ){\mu }^{2}{\sigma }_{ n}^{2}\mathrm{tr}\left \{E\left [\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\right ]\right \} \end{array}$$
(4.148)

Again by applying the property that tr[AB] = tr[BA] and assuming γ small, it follows that

$$\begin{array}{rcl} \mathrm{tr}\left \{{\mathbf{Q}}^{T}\mathrm{cov}[\Delta {\bf {w}}(k + 1)]\mathbf{Q}\right \}& =& \mathrm{tr}\left \{\mathbf{Q}\left [\bf{I} -{\mathbf{Q}}^{T}E\left (2\mu {\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right.\right.\right. \\ & & -\left.\left.\left.{\mu }^{2}{\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right )\mathbf{Q}\right ]{\mathbf{Q}}^{T}\mathrm{cov}[\Delta {\bf {w}}(k)]\mathbf{Q}{\mathbf{Q}}^{T}\right \} \\ & & +{\mu }^{2}{\sigma }_{ n}^{2}\mathrm{tr}\left \{E\left [\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\right ]\right \} \end{array}$$
(4.149)

By defining

$$\begin{array}{rcl} \Delta {{\bf {w}}}^{{\prime}}(k + 1) ={ \mathbf{Q}}^{T}\Delta {\bf {w}}(k + 1)& & \\ \end{array}$$

Equation (4.149) can be rewritten as

$$\begin{array}{rcl} \mathrm{tr}\{\mathrm{cov}[\Delta {{\bf {w}}}^{{\prime}}(k + 1)]\}& =& \mathrm{tr}\left \{{\mathbf{Q}}^{T}\mathbf{Q}\left [\bf{I} -{\mathbf{Q}}^{T}E\left (2\mu {\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right.\right.\right. \\ & & \qquad -\left.\left.\left.{\mu }^{2}{\bf{X}}_{\mathrm{ ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)\right )\mathbf{Q}\right ]\mathrm{cov}[\Delta {{\bf {w}}}^{{\prime}}(k)]\right \} \\ & & +{\mu }^{2}{\sigma }_{ n}^{2}\mathrm{tr}\left \{E\left [\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\right ]\right \} \\ & =& \mathrm{tr}\left \{\left [\bf{I} - 2\mu \hat{{\Lambda }} + {\mu }^{2}\hat{{\Lambda }}\right ]\mathrm{cov}[\Delta {{\bf {w}}}^{{\prime}}(k)]\right \} + {\mu }^{2}{\sigma }_{ n}^{2}\mathrm{tr}\left \{E\left [\hat{{\mathbf{S}}}_{\mathrm{ ap}}(k)\right ]\right \}\end{array}$$
(4.150)

where \(\hat{{\Lambda }}\) is a diagonal matrix whose elements are the eigenvalues of \(E[{\bf{X}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)]\), denoted as \(\hat{{\lambda }}_{i}\), for i = 0, , N.

By using the likely assumption that cov[Δw (k + 1)] and \(\hat{{\mathbf{S}}}_{\mathrm{ap}}(k)\) are diagonal dominant, we can disregard the trace operator in the above equation and observe that the geometric decaying curves have ratios \({r}_{\mathrm{cov}[\Delta {\bf {w}}(k)]} = (1 - 2\mu \hat{{\lambda }}_{i} + {\mu }^{2}\hat{{\lambda }}_{i})\). As a result, according to the considerations in the derivation of (3.52), it is possible to infer that the convergence time constant is given by

$$\begin{array}{rcl}{ \tau }_{ei}& =& {\tau }_{\mathrm{cov}[\Delta {\bf {w}}(k)]} \\ & =& \frac{1} {\mu \hat{{\lambda }}_{i}} \frac{1} {2 - \mu }\end{array}$$
(4.151)

since the error squared depends on the convergence of the diagonal elements of the covariance matrix of the coefficient-error vector, see discussions around (3.53). As can be observed, the time constants for error convergence are dependent on the inverse of the eigenvalues of \(E[{\bf{X}}_{\mathrm{ap}}(k)\hat{{\mathbf{S}}}_{\mathrm{ap}}(k){\bf{X}}_{\mathrm{ap}}^{T}(k)]\). However, since μ is not constrained by these eigenvalues, the speed of convergence is expected to be higher than for the LMS algorithm, particularly in situations where the eigenvalue spread of the input signal is high. Simulation results confirm the improved performance of the affine projection algorithm.

4.6.4 Complex Affine Projection Algorithm

Using the method of Lagrange multipliers to transform the constrained minimization into an unconstrained one, the unconstrained function to be minimized is

$$\begin{array}{rcl} F[{\bf {w}}(k + 1)] = \frac{1} {2}\|{\bf {w}}(k + 1) -{\bf {w}}{(k)\|}^{2} + \mathrm{re}\left \{{{\lambda }}_{\mathrm{ ap}}^{T}(k)[{\bf{d}}_{\mathrm{ ap}}(k)-{\bf{X}}_{\mathrm{ap}}^{T}(k){{\bf {w}}}^{{_\ast}}(k+1)]\right \}& &\end{array}$$
(4.152)

where λ ap(k) is a complex (L + 1) ×1 vector of Lagrange multipliers, and the real part operator is required in order to turn the overall objective function real valued. The above expression can be rewritten as

$$\begin{array}{rcl} F[{\bf {w}}(k + 1)]& =& \frac{1} {2}{[{\bf {w}}(k + 1) -{\bf {w}}(k)]}^{H}\left [{\bf {w}}(k + 1) -{\bf {w}}(k)\right ] \\ & & +\frac{1} {2}{{\lambda }}_{\mathrm{ap}}^{H}(k)\left [{\bf{d}}_{\mathrm{ ap}}^{{_\ast}}(k) -{\bf{X}}_{\mathrm{ ap}}^{H}(k){\bf {w}}(k + 1)\right ] \\ & & +\frac{1} {2}{{\lambda }}_{\mathrm{ap}}^{T}(k)\left [{\bf{d}}_{\mathrm{ ap}}(k) -{\bf{X}}_{\mathrm{ap}}^{T}(k){{\bf {w}}}^{{_\ast}}(k + 1)\right ]\end{array}$$
(4.153)

The gradient of F[w(k + 1)] with respect to w  ∗ (k + 1) is given byFootnote 6

$$\begin{array}{rcl} \frac{\partial F[{\bf {w}}(k + 1)]} {\partial {{\bf {w}}}^{{_\ast}}(k + 1)} ={ \bf{g}}_{{{\bf {w}}}^{{_\ast}}}\{F[{\bf {w}}(k + 1)]\} = \frac{1} {2}\left [{\bf {w}}(k + 1) -{\bf {w}}(k)\right ] -\frac{1} {2}{\bf{X}}_{\mathrm{ap}}(k){{\lambda }}_{\mathrm{ap}}(k)& &\end{array}$$
(4.154)

After setting the gradient of F[w(k + 1)] with respect to w  ∗ (k + 1) equal to zero, the expression below follows

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) +{ \bf{X}}_{\mathrm{ap}}(k){{\lambda }}_{\mathrm{ap}}(k)& &\end{array}$$
(4.155)

By replacing (4.155) in the constraint relation \({\bf{d}}_{\mathrm{ap}}^{{_\ast}}(k) -{\bf{X}}_{\mathrm{ap}}^{H}(k){\bf {w}}(k + 1) = \bf{0}\), we generate the expression

$$\begin{array}{rcl}{ \bf{X}}_{\mathrm{ap}}^{H}(k){\bf{X}}_{\mathrm{ ap}}(k){{\lambda }}_{\mathrm{ap}}(k) ={ \bf{d}}_{\mathrm{ap}}^{{_\ast}}(k) -{\bf{X}}_{\mathrm{ ap}}^{H}(k){\bf {w}}(k) ={ \bf{e}}_{\mathrm{ ap}}^{{_\ast}}(k)& &\end{array}$$
(4.156)

The update equation is now given by (4.155) with λ ap(k) being the solution of (4.156), i.e.,

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) +{ \bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{H}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}^{{_\ast}}(k)& &\end{array}$$
(4.157)

This updating equation corresponds to the complex affine projection algorithm with unity convergence factor. As common practice, we introduce a convergence factor in order to trade-off final misadjustment and convergence speed as follows

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) + \mu {\bf{X}}_{\mathrm{ap}}(k){\left ({\bf{X}}_{\mathrm{ap}}^{H}(k){\bf{X}}_{\mathrm{ ap}}(k)\right )}^{-1}{\bf{e}}_{\mathrm{ ap}}^{{_\ast}}(k)& &\end{array}$$
(4.158)

The description of the complex affine projection algorithm is given in Algorithm 4.6, where as before a regularization is introduced through an identity matrix multiplied by a small constant added to the matrix X ap H(k)X ap(k) in order to avoid numerical problems in the matrix inversion.

4.7 Examples

This section includes a number of examples in order to access the performance of the LMS-based algorithms described in this chapter.

4.7.1 Analytical Examples

Example 4.3 (Stochastic Gradient Algorithm). 

Derive the update equation for a stochastic gradient algorithm designed to minimize the following objective function.

$$\begin{array}{rcl} E\left [F[{\bf {w}}(k)]\right ] = E\left [a\vert d(k) -{{\bf {w}}}_{1}^{H}(k)\bf{x}(k){\vert }^{4} + b\vert d(k) -{{\bf {w}}}_{ 2}^{T}(k)\bf{x}(k){\vert }^{4}\right ]& & \\ \end{array}$$

where

$$\begin{array}{rcl} {\bf {w}}(k) = \left [\begin{array}{c} {{\bf {w}}}_{1}(k) \\ {{\bf {w}}}_{2}(k) \end{array} \right ]& & \\ \end{array}$$

and w 2(k) is a vector with real-valued entries. The parameters a and b are also real.

Solution.

The given objective function can be rewritten as

$$\begin{array}{rcl} F[{\bf {w}}(k)]& =& a\left \{{(d(k) -{{\bf {w}}}_{1}^{H}(k)\bf{x}(k))}^{2}{({d}^{{_\ast}}(k) -{{\bf {w}}}_{ 1}^{T}(k){\bf{x}}^{{_\ast}}(k))}^{2}\right \} \\ & & +b\left \{{(d(k) -{{\bf {w}}}_{2}^{T}(k)\bf{x}(k))}^{2}{({d}^{{_\ast}}(k) -{{\bf {w}}}_{ 2}^{T}(k){\bf{x}}^{{_\ast}}(k))}^{2}\right \} \\ \end{array}$$

where by denoting \({e}_{1}(k) = d(k) -{{\bf {w}}}_{1}^{H}(k)\bf{x}(k)\) and \({e}_{2}(k) = d(k) -{{\bf {w}}}_{2}^{T}(k)\bf{x}(k)\), it is possible to compute the gradient expression as

$$\begin{array}{rcl}{ \bf{g}}_{{{\bf {w}}}^{{_\ast}}}\{F[{\bf {w}}(k)]\}& =& \left [\begin{array}{c} - 2a{e}_{1}^{{_\ast}}(k)\bf{x}(k)\vert {e}_{1}(k){\vert }^{2} \\ - 2b{e}_{2}^{{_\ast}}(k)\bf{x}(k)\vert {e}_{2}(k){\vert }^{2} - 2b{e}_{2}(k){\bf{x}}^{{_\ast}}(k)\vert {e}_{2}(k){\vert }^{2} \end{array} \right ]\\ \end{array}$$

The updating equation is then given by

$$\begin{array}{rcl} {\bf {w}}(k + 1)& =& {\bf {w}}(k) - \mu \left [\begin{array}{c} - 2a{e}_{1}^{{_\ast}}(k)\bf{x}(k)\vert {e}_{1}(k){\vert }^{2} \\ - 4b\:\:\mathrm{re}\left [{e}_{2}^{{_\ast}}(k)\bf{x}(k)\right ]\vert {e}_{2}(k){\vert }^{2} \end{array} \right ] \\ & =& {\bf {w}}(k) + \mu \left [\begin{array}{c} 2a{e}_{1}^{{_\ast}}(k)\bf{x}(k)\vert {e}_{1}(k){\vert }^{2} \\ 4b\:\:\mathrm{re}\left [{e}_{2}^{{_\ast}}(k)\bf{x}(k)\right ]\vert {e}_{2}(k){\vert }^{2} \end{array} \right ] \square \\ \end{array}$$

Example 4.4.

Normalized LMS Algorithm

  1. (a)

    A normalized LMS algorithm using convergence factor equal to one has the following data available

    $$\begin{array}{rcl} \bf{x}(0)& =& \left [\begin{array}{c} 2 + {\epsilon }_{1}\\ 2\end{array} \right ] \\ d(0)& =1& \end{array}$$
    (1)

    and

    $$\begin{array}{rcl} \bf{x}(1)& =& \left [\begin{array}{c} 1\\ 1 + {\epsilon }_{2}\end{array} \right ] \\ d(1)& =0& \end{array}$$
    (0)

    where the initial values for the coefficients are zero and ε1 and ε2 are real-valued constants. Determine the hyperplanes

    $$\begin{array}{rcl} \mathcal{S}(k) =\{ {\bf {w}}(k + 1) \in {\mathbb{R}}^{2} : d(k) -{{\bf {w}}}^{T}(k + 1)\bf{x}(k) = 0\}& & \\ \end{array}$$

    for two updates.

  2. (b)

    If the given data belong to an identification problem without additional noise, what would be the coefficients of the unknown system?

  3. (c)

    What would be the solution if \({\epsilon }_{1} = {\epsilon }_{2} = 0\)?

Solution.

  1. (a)

    The hyperplanes defined by the given data vectors are respectively given by

    $$\begin{array}{rcl} \mathcal{S}(0) =\{ {\bf {w}}(1) \in {\mathbb{R}}^{2} : 1 - (2 + {\epsilon }_{ 1}){w}_{0}(1) - 2{w}_{1}(1) = 0\}& & \\ \end{array}$$

    and

    $$\begin{array}{rcl} \mathcal{S}(1) =\{ {\bf {w}}(2) \in {\mathbb{R}}^{2} : 0 - {w}_{ 0}(2) - (1 + {\epsilon }_{2}){w}_{1}(2) = 0\}& & \\ \end{array}$$
  2. (b)

    The solution lies on \(\mathcal{S}(0) \cap \mathcal{S}(1)\). Thus

    $$\begin{array}{rcl} (2 + {\epsilon }_{1}){w}_{0} + 2{w}_{1}& =& 1 \\ {w}_{0} + (1 + {\epsilon }_{2}){w}_{1}& =& \end{array}$$
    (0)

    whose solution is

    $$\begin{array}{rcl}{ {\bf {w}}}_{o}& =& \left [\begin{array}{c} \frac{1+{\epsilon }_{2}} {{\epsilon }_{1}+{\epsilon }_{1}{\epsilon }_{2}+2{\epsilon }_{2}} \\ \frac{-1} {{\epsilon }_{1}+{\epsilon }_{1}{\epsilon }_{2}+2{\epsilon }_{2}}\end{array} \right ]\\ \end{array}$$

    assuming ε1≠0 and ε2≠0.

  3. (c)

    For \({\epsilon }_{1} = {\epsilon }_{2} = 0\) the hyperplanes \(\mathcal{S}(1)\) and \(\mathcal{S}(2)\) are parallel and the solution before is not valid. In this case there is no solution. □ 

Example 4.5 (Complex Normalized LMS Algorithm). 

Which objective function is actually minimized by the complex normalized LMS algorithm with regularization factor γ and convergence factor μ n ?

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) + \frac{{\mu }_{n}} {\gamma +{ \bf{x}}^{H}(k)\bf{x}(k)}\bf{x}(k){\mathrm{e}}^{{_\ast}}(k)& &\end{array}$$
(4.159)

Assume that γ is included for regularization purposes.

Solution.

Our main task is to search for an objective function whose stochastic gradient corresponds to the last term of the above equation. Define

$$\begin{array}{rcl} \alpha & =& \left ( \frac{1} {{\mu }_{n}} - 1 + {\alpha }_{p}\gamma \right )\end{array}$$
(4.160)

The objective function to be minimized with respect to the coefficients w  ∗ (k + 1) is given by

$$\begin{array}{rcl} \xi (k) = \alpha \|{\bf {w}}(k + 1) -{\bf {w}}{(k)\|}^{2} + {\alpha }_{ p}\|d(k) -{\bf{x}}^{T}(k){{\bf {w}}}^{{_\ast}}{(k + 1)\|}^{2}\quad & &\end{array}$$
(4.161)

where

$$\begin{array}{rcl}{ \alpha }_{p} = \frac{1} {\gamma +{ \bf{x}}^{H}(k)\bf{x}(k)}& &\end{array}$$
(4.162)

This result can be verified by computing the derivative of the objective function with respect to w  ∗ (k + 1) as following described.

$$\begin{array}{rcl} \frac{\partial \xi (k)} {\partial {{\bf {w}}}^{{_\ast}}(k + 1)}& =& \alpha [{\bf {w}}(k + 1) -{\bf {w}}(k)] - {\alpha }_{p}\bf{x}(k)\left [{d}^{{_\ast}}(k) -{\bf{x}}^{H}(k){\bf {w}}(k + 1)\right ] \\ \end{array}$$

By setting this result to zero it follows that

$$\begin{array}{rcl} \left [\alpha \bf{I} + {\alpha }_{p}\bf{x}(k){\bf{x}}^{H}(k)\right ]{\bf {w}}(k + 1)& =& \alpha {\bf {w}}(k) + {\alpha }_{ p}\bf{x}(k){d}^{{_\ast}}(k) - {\alpha }_{ p}\bf{x}(k){\bf{x}}^{H}(k){\bf {w}}(k) \\ & & +{\alpha }_{p}\bf{x}(k){\bf{x}}^{H}(k){\bf {w}}(k) \\ & =& \left [\alpha \bf{I} + {\alpha }_{p}\bf{x}(k){\bf{x}}^{H}(k)\right ]{\bf {w}}(k) + {\alpha }_{ p}\bf{x}(k){e}^{{_\ast}}(k) \\ \end{array}$$

This equation can be rewritten as

$$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) + {\alpha }_{p}{\left [\alpha \bf{I} + {\alpha }_{p}\bf{x}(k){\bf{x}}^{H}(k)\right ]}^{-1}\bf{x}(k){e}^{{_\ast}}(k)& &\end{array}$$
(4.163)

After applying the matrix inversion lemma, as in (13.28), to compute the inverse in the above equation we get

$$\begin{array}{rcl}{ \left [\alpha \bf{I} + {\alpha }_{p}\bf{x}(k){\bf{x}}^{H}(k)\right ]}^{-1}& =& \frac{\bf{I}} {\alpha } - \frac{\bf{I}} {\alpha }\bf{x}(k){\left [\frac{{\bf{x}}^{H}(k)\bf{x}(k)} {\alpha } + \frac{1} {{\alpha }_{p}}\right ]}^{-1}{\bf{x}}^{H}(k) \frac{\bf{I}} {\alpha } \\ & =& \frac{1} {\alpha }\left [\bf{I} - \frac{\bf{x}(k){\bf{x}}^{H}(k)} {{\bf{x}}^{H}(k)\bf{x}(k) + \frac{\alpha } {{\alpha }_{p}}}\right ] \\ \end{array}$$

Since the above equation will be multiplied on the right-hand side by x(k), it then follows that

$$\begin{array}{rcl} \frac{1} {\alpha }\left [\bf{I} - \frac{\bf{x}(k){\bf{x}}^{H}(k)} {{\bf{x}}^{H}(k)\bf{x}(k) + \frac{\alpha } {{\alpha }_{p}}}\right ]\bf{x}(k)& =& \frac{1} {\alpha }\left [ \frac{\alpha } {{\alpha }_{p}} \frac{\bf{x}(k)} {{\bf{x}}^{H}(k)\bf{x}(k) + \frac{\alpha } {{\alpha }_{p}}}\right ] \\ & =& \frac{\bf{x}(k)} {{\alpha }_{p}{\bf{x}}^{H}(k)\bf{x}(k) + \alpha } \\ \end{array}$$

By employing the relation \(\alpha = \left ( \frac{1} {{\mu }_{n}} - 1 + {\alpha }_{p}\gamma \right )\) in the expression above it follows that

$$\begin{array}{rcl} \frac{\bf{x}(k)} {{\alpha }_{p}{\bf{x}}^{H}(k)\bf{x}(k) + \alpha } = {\mu }_{n}\bf{x}(k)& & \\ \end{array}$$

By replacing the above result in (4.163), it is possible to show that

$$\begin{array}{rcl} {\bf {w}}(k + 1)& =& {\bf {w}}(k) + {\mu }_{n}{\alpha }_{p}\bf{x}(k){e}^{{_\ast}}(k) \\ & =& {\bf {w}}(k) + {\mu }_{n}{\left (\gamma +{ \bf{x}}^{H}(k)\bf{x}(k)\right )}^{-1}\bf{x}(k){e}^{{_\ast}}(k) \\ \end{array}$$

 □ 

Example 4.6 (Transform-Domain LMS algorithm). 

A transform-domain LMS algorithm is used in an application requiring two coefficients and employing the DCT.

  1. (a)

    Show in detail the update equation related to each adaptive filter coefficient as a function of the input signal, given γ and σ x 2, where the former is the regularization factor and the latter is the variance of the input signal x(k).

  2. (b)

    Which value of μ would generate an a posteriori error equal to zero?

Solution.

  1. (a)

    The transform matrix in this case is given by

    $$\begin{array}{rcl} \bf{T} = \left [\begin{array}{cc} \frac{\sqrt{2}} {2} & \frac{\sqrt{2}} {2} \\ \frac{\sqrt{2}} {2} & -\frac{\sqrt{2}} {2}\\ \end{array} \right ]& & \\ \end{array}$$

    The update equation of the first coefficient is

    $$\begin{array}{rcl} \hat{{w}}_{0}(k + 1)& =& \hat{{w}}_{0}(k) + \frac{2\mu } {\gamma + {\sigma }_{0}^{2}(k)}e(k){s}_{0}(k) \\ & =& \hat{{w}}_{0}(k) + \frac{2\mu } {\sqrt{2}(\gamma + {\sigma }_{0}^{2}(k))}e(k)({x}_{0}(k) + {x}_{1}(k)) \\ \end{array}$$

    and of the second coefficient is

    $$\begin{array}{rcl} \hat{{w}}_{1}(k + 1)& =& \hat{{w}}_{1}(k) + \frac{2\mu } {\gamma + {\sigma }_{1}^{2}(k)}e(k){s}_{1}(k) \\ & =& \hat{{w}}_{1}(k) + \frac{2\mu } {\sqrt{2}(\gamma + {\sigma }_{1}^{2}(k))}e(k)({x}_{0}(k) - {x}_{1}(k)) \\ \end{array}$$

    where \({\sigma }_{0}^{2}(k) = {\sigma }_{1}^{2}(k) = \frac{1} {2}{\sigma }_{{x}_{0}}^{2}(k) + \frac{1} {2}{\sigma }_{{x}_{1}}^{2}(k)\). These variances are estimated by \({\sigma }_{{x}_{i}}^{2}(k) = \alpha {x}_{i}^{2}(k) + (1 - \alpha ){\sigma }_{{x}_{i}}^{2}(k - 1)\), for i = 0, 1, α is a small factor chosen in the range 0 < α ≤ 0. 1, and γ is the regularization factor.

  2. (b)

    In matrix form the above updating equation can be rewritten as

    $$\begin{array}{rcl} \hat{{\bf {w}}}(k + 1) = \hat{{\bf {w}}}(k) + 2\mu e(k){{\Sigma }}^{-2}(k)\bf{s}(k)& & \end{array}$$
    (4.164)

    where Σ  − 2(k) is a diagonal matrix containing as elements the inverse of the power estimates of the elements of s(k) added to the regularization factor γ. By replacing the above expression in the a posteriori error definition, it follows that

    $$\begin{array}{rcl} \epsilon (k)& =& d(k) -{\bf{s}}^{T}(k)\hat{{\bf {w}}}(k + 1) \\ & =& d(k) -{\bf{s}}^{T}(k)\hat{{\bf {w}}}(k) - 2\mu e(k){\bf{s}}^{T}(k){{\Sigma }}^{-2}(k)\bf{s}(k) = \end{array}$$
    (0)

    leading to

    $$\begin{array}{rcl} \mu = \frac{1} {2{\bf{s}}^{T}(k){{\Sigma }}^{-2}(k)\bf{s}(k)}& & \\ \end{array}$$

     □ 

4.7.2 System Identification Simulations

In this subsection, a standard system identification problem is described and solved by using some of the algorithms presented in this chapter.

Example 4.7 (Transform-Domain LMS Algorithm). 

Use the transform-domain LMS algorithm to identify the system described in example of Sect. 3.6.2. The transform is the DCT.

Solution.

All the results presented here for the transform-domain LMS algorithm are obtained by averaging the results of 200 independent runs.

We run the algorithm with a value of μ = 0. 01, with α = 0. 05 and \(\gamma = 1{0}^{-6}\). With this value of μ, the misadjustment of the transform-domain LMS algorithm is about the same as that of the LMS algorithm with μ = 0. 02. In Fig. 4.13, the learning curves for the eigenvalue spreads 20 and 80 are illustrated. First note that the convergence speed is about the same for different eigenvalue spreads, showing the effectiveness of the rotation performed by the transform in this case. If we compare these curves with those of Fig. 3.9 for the LMS algorithm, we conclude that the transform-domain LMS algorithm has better performance than the LMS algorithm for high eigenvalue spread. For an eigenvalue spread equal to 20, the transform-domain LMS algorithm requires around 200 iterations to achieve convergence, whereas the LMS requires at least 500 iterations. This improvement is achieved without increasing the misadjustment as can be verified by comparing the results of Tables 3.1 and 4.1.

The reader should bear in mind that the improvements in convergence of the transform-domain LMS algorithm can be achieved only if the transformation is effective. In this example, since the input signal is colored using a first-order all-pole filter, the cosine transform is known to be effective because it approximates the KLT.

The finite-precision implementation of the transform-domain LMS algorithm presents similar performance to that of the LMS algorithm, as can be verified by comparing the results of Tables 3.1 and 4.2. An eigenvalue spread of one is used in this example. The value of μ is 0. 01, while the remaining parameter values are \(\gamma = {2}^{-{b}_{d}}\) and α = 0. 05. The value of μ in this case is chosen the same as for the LMS algorithm. □ 

Fig. 4.13
figure 13

Learning curves for the transform-domain LMS algorithm for eigenvalue spreads: 20 and 80

Table 4.1 Evaluation of the Transform-Domain LMS Algorithm
Table 4.2 Results of the Finite-Precision Implementation of the Transform-Domain LMS Algorithm

Example 4.8 (Affine Projection Algorithm). 

An adaptive-filtering algorithm is used to identify the system described in example of Sect. 3.6.2 using the affine projection algorithm using L = 0, L = 1 and L = 4. Do not consider the finite-precision case.

Solution.

Figure 4.14 depicts the estimate of the MSE learning curve of the affine projection algorithm for the case of eigenvalue spread equal to 1, obtained by averaging the results of 200 independent runs. As can be noticed by increasing L the algorithm becomes faster. The chosen convergence factor is μ = 0. 4, and the measured misadjustments are M = 0. 32 for L = 0, M = 0. 67 for L = 1, and M = 2. 05 for L = 4. In all cases γ = 0 is utilized, and for L = 1 in the first iteration we start with L = 0, whereas for L = 4 in the first four iterations we employ L = 0, 1, 2, and 3, respectively. If we consider that the term \(E\left [ \frac{1} {\|\bf{x}{(k)\|}^{2}} \right ] \approx \frac{1} {(N+1){\sigma }_{x}^{2}}\), the expected misadjustment according to (4.126) is M = 0. 25, which is somewhat close to the measured ones considering the above approximation as well as the approximations in the derivation of the theoretical formula.

Fig. 4.14
figure 14

Learning curves for the affine projection algorithms for L = 0, L = 1, and L = 4, eigenvalue spread equal 1

Figure 4.15 depicts the average of the squared error obtained from 200 independent runs for the case of eigenvalue spread equal to 80. Again we verify that by increasing L the algorithm becomes faster. The chosen convergence factor is also μ = 0. 4, and the measured misadjustments for three values of the eigenvalue spread are listed in Table 4.3. It can be observed that higher eigenvalue spreads do not increase the misadjustment substantially.

Table 4.3 Evaluation of the Affine Projection Algorithm, μ = 0. 4
Fig. 4.15
figure 15

Learning curves for the affine projection algorithms for L = 0, L = 1, and L = 4, eigenvalue spread equal 80

In Fig. 4.16, it is shown the effect of using different values for the convergence factor, when L = 1 and the eigenvalue spread is equal to 1. For μ = 0. 2 the misadjustment is M = 0. 30, for μ = 0. 4 the misadjustment is M = 0. 67, and for μ = 1 the misadjustment is M = 1. 56. □ 

4.7.3 Signal Enhancement Simulations

In this subsection, a signal enhancement simulation environment is described. This example will also be employed in some of the following chapters.

Fig. 4.16
figure 16

Learning curves for the affine projection algorithms for μ = 0. 2, μ = 0. 4, and μ = 1

In a signal enhancement problem, the reference signal is

$$r(k) =\sin (0.2\pi k) + {n}_{r}(k)$$

where n r (k) is zero-mean Gaussian white noise with variance \({\sigma }_{{n}_{r}}^{2} = 10\). The input signal is given by n r (k) passed through a filter with the following transfer function

$$H(z) = \frac{0.4} {{z}^{2} - 1.36z + 0.79}$$

The adaptive filter is a 20th-order FIR filter. In all examples, a delay L = 10 is applied to the reference signal.

Example 4.9 (Quantized-Error and Normalized LMS Algorithms). 

Using the sign-error, power-of-two error with b d  = 12, and normalized LMS algorithms:

  1. (a)

    Choose an appropriate μ in each case and run an ensemble of 50 experiments. Plot the average learning curve.

  2. (b)

    Plot the output errors and comment on the results.

Solution.

The maximum value of μ for the LMS algorithm in this example is 0. 005. The value of μ for both the sign-error and power-of-two LMS algorithms is chosen 0. 001. The coefficients of the adaptive filter are initialized with zero. For the normalized LMS algorithm μ n  = 0. 4 and \(\gamma = 1{0}^{-6}\) are used. Fig. 4.17 depicts the learning curves for the three algorithms. The results show that the sign-error and power-of-two error algorithms present similar convergence speed, whereas the normalized LMS algorithm is faster to converge. The reader should notice that the MSE after convergence is not small since we are dealing with an example where the signal-to-noise ratio is low.

Fig. 4.17
figure 17

Learning curves for the (a) Sign-error, (b) Power-of-two, and (c) Normalized LMS algorithms

The DFT with 128 points of the input signal is shown in Fig. 4.18 where the presence of the sinusoid cannot be noted. In the same figure are shown the DFT of the error and the error signal itself, for the experiment using the normalized LMS algorithm. In the cases of DFT, the result presented is the magnitude of the DFT outputs. As can be verified, the output error tends to produce a signal with the same period of the sinusoid after convergence and the DFT shows clearly the presence of the sinusoid. The other two algorithms lead to similar results. □ 

Fig. 4.18
figure 18

(a) DFT of the input signal, (b) DFT of the error signal, (c) The output error for the normalized LMS algorithm

Fig. 4.19
figure 19

Learning curves for the (a) Sign-error, (b) Power-of-two, and (c) Normalized LMS algorithms

4.7.4 Signal Prediction Simulations

In this subsection a signal prediction simulation environment is described. This example will also be used in some of the following chapters.

In a prediction problem the input signal is

$$x(k) = -\sqrt{2}\ \sin (0.2\pi k) + \sqrt{2}\ \sin (0.05\pi k) + {n}_{x}(k)$$

where n x (k) is zero-mean Gaussian white noise with variance \({\sigma }_{{n}_{x}}^{2} = 1\). The adaptive filter is a fourth-order FIR filter.

  1. (a)

    Run an ensemble of 50 experiments and plot the average learning curve.

  2. (b)

    Determine the zeros of the resulting FIR filter and comment on the results.

Example 4.10 (Quantized-Error and Normalized LMS Algorithms). 

We solve the above problem using the sign-error, power-of-two error with b d  = 12, and normalized LMS algorithms.

Solution.

In the first step, each algorithm is tested in order to determine experimentally the maximum value of μ in which the convergence is achieved. The choice of the convergence factor is μmax ∕ 5 for each algorithm. The chosen values of μ for the sign-error and power-of-two LMS algorithms are 0. 0028 and 0. 0044, respectively. For the normalized LMS algorithm, μ n  = 0. 4 and \(\gamma = 1{0}^{-6}\) are used. The coefficients of the adaptive filter are initialized with zero. The learning curves for the three algorithms are depicted in Fig. 4.19. In all cases, we notice a strong attenuation of the predictor response around the frequencies of the two sinusoids. See, for example, the response depicted in Fig. 4.20 obtained by running the power-of-two LMS algorithm. The zeros of the transfer function from the input to the output error are calculated for the power-of-two algorithm:

$$-0.3939;\:-0.2351 \pm \mathrm{J}0.3876;\:-0.6766 \pm \mathrm{J}0.3422$$

Notice that the predictor tends to place its zeros at low frequencies, in order to attenuate the two low-frequency sinusoids.

Fig. 4.20
figure 20

Magnitude response of the FIR adaptive filter at a given iteration after convergence using the power-of-two LMS algorithm

In the experiments, we notice that for a given additional noise, smaller convergence factor leads to higher attenuation at the sinusoid frequencies. This is an expected result since the excess MSE is smaller. Another observation is that the attenuation also grows as the signal-to-noise ratio is reduced, again due to the smaller MSE. □ 

4.8 Concluding Remarks

In this chapter, a number of adaptive-filtering algorithms were presented derived from the LMS algorithm. There were two basic directions followed in the derivation of the algorithms: one direction was to search for simpler algorithms from the computational point of view, and the other was to sophisticate the LMS algorithm looking for improvements in performance. The simplified algorithms lead to low-power, low-complexity and/or high-speed integrated circuit implementations [31], at a cost of increasing the misadjustment and/or of losing convergence speed among other things [32]. The simplified algorithms discussed here were the quantized-error algorithms.

We also introduced the LMS-Newton algorithm, whose performance is independent of the eigenvalue spread of the input signal correlation matrix. This algorithm is related to the RLS algorithm which will be discussed in the following chapter, although some distinctive features exist between them [41]. Newton-type algorithms with reduced computational complexity are also known [42, 43], and the main characteristic of this class of algorithms is to reduce the computation involving the inverse of the estimate of R.

In the normalized LMS algorithm, the straightforward objective was to find the step size that minimizes the instantaneous output error. There are many papers dealing with the analysis [33]-[35] and applications [36] of the normalized LMS algorithm. The idea of using variable step size in the LMS and normalized LMS algorithms can lead to a number of interesting algorithms [37]-[39], that in some cases are very efficient in tracking nonstationary environments [40].

The transform-domain LMS algorithm aimed at reducing the eigenvalue spread of the input signal correlation matrix. Several frequency-domain adaptive algorithms, which are related in some sense to the transform-domain LMS algorithm, have been investigated in the recent years [44]. Such algorithms exploit the whitening property associated with the normalized transform-domain LMS algorithm, and most of them update the coefficients at a rate lower than the input sampling rate. One of the resulting structures, presented in [45], can be interpreted as a direct generalization of the transform-domain LMS algorithm and is called generalized adaptive subband decomposition structure. Such structure consists of a small-size fixed transform, which is applied to the input sequence, followed by sparse adaptive subfilters which are updated at the input rate. In high-order adaptive-filtering problems, the use of this structure with appropriately chosen transform-size and sparsity factor can lead to significant convergence rate improvement for colored input signals when compared to the standard LMS algorithm. The convergence rate improvement is achieved without the need for large transform sizes. Other algorithms to deal with high-order adaptive filters are discussed in Chap. 12.

The affine projection algorithm is very appealing in applications requiring a trade-off between convergence speed and computational complexity. Although the algorithms in the affine projection family might have high misadjustment, their combination with deterministic objective functions leading to data selective updating results in computationally efficient algorithms with low misadjustment and high convergence speed [25], as will be discussed in Chap. 6.

Several simulation examples involving the LMS-based algorithms were presented in this chapter. These examples aid the reader to understand what are the main practical characteristics of the LMS-based algorithms.

4.9 Problems

  1. 1.

    From (4.16) derive the difference equation for v (k) given by (4.19).

  2. 2.

    Prove the validity of (4.27).

  3. 3.

    The sign-error algorithm is used to predict the signal \(x(k) =\sin (\pi k/3)\) using a second-order FIR filter with the first tap fixed at 1, by minimizing the mean square value of y(k). This is an alternative way to interpret how the predictor works. Calculate an appropriate μ, the output signal y(k), and the filter coefficients for the first 10 iterations. Start with w T(0) = [1 0 0].

  4. 4.

    Derive an LMS-Newton algorithm leading to zero a posteriori error.

  5. 5.

    Derive the updating equations of the affine projection algorithm, for L = 1.

  6. 6.

    Use the sign-error algorithm to identify a system with the transfer function given below. The input signal is a uniformly distributed white noise with variance σ x 2 = 1, and the measurement noise is Gaussian white noise uncorrelated with the input with variance \({\sigma }_{n}^{2} = 1{0}^{-3}\). The adaptive filter has 12 coefficients.

    $$H(z) = \frac{1 - {z}^{-12}} {1 + {z}^{-1}}$$
    1. (a)

      Calculate the upper bound for μ (μmax) to guarantee the algorithm stability.

    2. (b)

      Run the algorithm for μmax ∕ 2, μmax ∕ 5, and μmax ∕ 10. Comment on the convergence behavior in each case.

    3. (c)

      Measure the misadjustment in each example and compare with the results obtained by (4.28).

    4. (d)

      Plot the obtained FIR filter frequency response at any iteration after convergence is achieved and compare with the unknown system.

  7. 7.

    Repeat the previous problem using an adaptive filter with 8 coefficients and interpret the results.

  8. 8.

    Repeat problem 6 when the input signal is a uniformly distributed white noise with variance \({\sigma }_{{n}_{x}}^{2} = 0.5\), filtered by an all-pole filter given by

    $$H(z) = \frac{z} {z - 0.9}$$
  9. 9.

    In problem 6, consider that the additional noise has the following variances (a) σ n 2 = 0, (b) σ n 2 = 1. Comment on the results obtained in each case.

  10. 10.

    Perform the equalization of a channel with the following impulse response

    $$h(k) = ku(k) - (2k - 9)u(k - 5) + (k - 9)u(k - 10)$$

    using a known training signal consisting of a binary (-1,1) random signal. An additional Gaussian white noise with variance 10 − 2 is present at the channel output.

    1. (a)

      Apply the sign-error with an appropriate μ and find the impulse response of an equalizer with 15 coefficients.

    2. (b)

      Convolve the equalizer impulse response at an iteration after convergence, with the channel impulse response and comment on the result.

  11. 11.

    In a system identification problem, the input signal is generated by an autoregressive process given by

    $$x(k) = -1.2x(k - 1) - 0.81x(k - 2) + {n}_{x}(k)$$

    where n x (k) is zero-mean Gaussian white noise with variance such that σ x 2 = 1. The unknown system is described by

    $$H(z) = 1 + 0.9{z}^{-1} + 0.1{z}^{-2} + 0.2{z}^{-3}$$

    The adaptive filter is also a third-order FIR filter. Using the sign-error algorithm:

    1. (a)

      Choose an appropriate μ, run an ensemble of 20 experiments, and plot the average learning curve.

    2. (b)

      Measure the excess MSE and compare the results with the theoretical value.

  12. 12.

    In the previous problem, calculate the time constant τ wi and the expected number of iterations to achieve convergence.

  13. 13.

    The sign-error algorithm is applied to identify a 7th-order time-varying unknown system whose coefficients are first-order Markov processes with λ w  = 0. 999 and σ w 2 = 0. 001. The initial time-varying system multiplier coefficients are

    $${{\bf {w}}}_{o}^{T} = [0.03490\:\:\: - 0.011\:\:\: - 0.06864\:\:\:0.22391\:\:\:0.55686\:\:\:0.35798\:\:\: - 0.0239\:\:\: - 0.07594]$$

    The input signal is Gaussian white noise with variance σ x 2 = 0. 7, and the measurement noise is also Gaussian white noise independent of the input signal and of the elements of n w (k), with variance σ n 2 = 0. 01.For μ = 0. 01, simulate the experiment described and measure the excess MSE.

  14. 14.

    Reduce the value of λ w to 0.95 in problem 13, simulate, and comment on the results.

  15. 15.

    Suppose a 15th-order FIR digital filter with multiplier coefficients given below, is identified through an adaptive FIR filter of the same order using the sign-error algorithm. Use fixed-point arithmetic and run simulations for the following case.

    $$\begin{array}{@{}l@{\qquad }l} \mbox{ Additional noise: white noise with variance} \qquad &{\sigma }_{n}^{2} = 0.0015 \\ \mbox{ Coefficient wordlength:} \qquad &{b}_{c} = 16\mbox{ bits} \\ \mbox{ Signal wordlength:} \qquad &{b}_{d} = 16\mbox{ bits} \\ \mbox{ Input signal: Gaussian white noise with variance}\qquad &{\sigma }_{x}^{2} = 0.7 \\ \qquad &\mu = 0.01\\ \qquad \end{array}$$
    $$\begin{array}{rcl}{ {\bf {w}}}_{o}^{T}& = [0.0219360\ 0.0015786\ - 0.0602449\ - 0.0118907\ 0.1375379 & \\ & \qquad 0.0574545\ - 0.3216703\ - 0.5287203\ - 0.2957797\ 0.0002043 & \\ & \qquad 0.290670\ - 0.0353349\ - 0.068210\ 0.0026067\ 0.0010333\ - 0.0143593]& \\ \end{array}$$

    Plot the learning curves of the estimates of E[ | | Δw(k) Q  | | 2] and ξ(k) Q obtained through 25 independent runs, for the finite- and infinite-precision implementations.

  16. 16.

    Repeat the above problem for the following cases

    1. (a)

      σ n 2 = 0. 01, b c  = 12 bits, b d  = 12 bits, σ x 2 = 0. 7, \(\mu = 1{0}^{-4}\).

    2. (b)

      σ n 2 = 0. 1, b c  = 10 bits, b d  = 10 bits, σ x 2 = 0. 8, \(\mu = 2.0\:\:1{0}^{-5}\).

    3. (c)

      σ n 2 = 0. 05, b c  = 14 bits, b d  = 16 bits, σ x 2 = 0. 8, \(\mu = 3.5\:\:1{0}^{-4}\).

  17. 17.

    Repeat problem 15 for the case where the input signal is a first-order Markov process with λ x  = 0. 95.

  18. 18.

    Repeat problem 6 for the dual-sign algorithm given ε = 16 and ρ = 1, and comment on the results.

  19. 19.

    Repeat problem 6 for the power-of-two error algorithm given b d  = 6 and \(\tau = {2}^{-{b}_{d}}\), and comment on the results.

  20. 20.

    Repeat problem 6 for the sign-data and sign-sign algorithms and compare the results.

  21. 21.

    Show the validity of the matrix inversion lemma defined in (4.51).

  22. 22.

    For the setup described in problem 8, choose an appropriate μ and run the LMS-Newton algorithm.

    1. (a)

      Measure the misadjustment.

    2. (b)

      Plot the frequency response of the FIR filter obtained after convergence is achieved and compare with the unknown system.

  23. 23.

    Repeat problem 8 using the normalized LMS algorithm.

  24. 24.

    Repeat problem 8 using the transform-domain LMS algorithm with DFT. Compare the results with those obtained with the standard LMS algorithm.

  25. 25.

    Repeat problem 8 using the affine projection algorithm.

  26. 26.

    Repeat problem 8 using the transform-domain LMS algorithm with DCT.

  27. 27.

    For the input signal described in problem 8, derive the autocorrelation matrix of order one (2 ×2). Apply the DCT and the normalization to R in order to generate \(\hat{\bf{R}} ={ {\Sigma }}^{-2}\bf{T}\bf{R}{\bf{T}}^{T}\). Compare the eigenvalue spreads of R and \(\hat{\bf{R}}\).

  28. 28.

    Repeat the previous problem for R with dimension 3 by 3.

  29. 29.

    Use the complex affine projection algorithm with L = 3 to equalize a channel with the transfer function given below. The input signal is a four QAM signal representing a randomly generated bit stream with the signal-to-noise ratio \(\frac{{\sigma }_{\tilde{x}}^{2}} {{\sigma }_{n}^{2}} = 20\) at the receiver end, that is, \(\tilde{x}(k)\) is the received signal without taking into consideration the additional channel noise. The adaptive filter has ten coefficients.

    $$H(z) = (0.34 - 0.27\mathrm{J}) + (0.87 + 0.43\mathrm{J}){z}^{-1} + (0.34 - 0.21\mathrm{J}){z}^{-2}$$
    1. (a)

      Run the algorithm for μ = 0. 1, μ = 0. 4, and μ = 0. 8. Comment on the convergence behavior in each case.

    2. (b)

      Plot the real versus imaginary parts of the received signal before and after equalization.

    3. (c)

      Increase the number of coefficients to 20 and repeat the experiment in (b).

  30. 30.

    Repeat problem 29 for the case of the normalized LMS algorithm.

  31. 31.

    In a system identification problem the input signal is generated from a four QAM of the form

    $$x(k) = {x}_{\mathrm{re}}(k) + \mathrm{J}{x}_{\mathrm{im}}(k)$$

    where x re(k) and x im(k) assume values ± 1 randomly generated. The unknown system is described by

    $$H(z) = 0.32 + 0.21\mathrm{J} + (-0.3 + 0.7\mathrm{J}){z}^{-1} + (0.5 - 0.8\mathrm{J}){z}^{-2} + (0.2 + 0.5\mathrm{J}){z}^{-3}$$

    The adaptive filter is also a third-order complex FIR filter, and the additional noise is composed of zero-mean Gaussian white noises in the real and imaginary parts with variance σ n 2 = 0. 4. Using the complex affine projection algorithm with L = 1, choose an appropriate μ, run an ensemble of 20 experiments, and plot the average learning curve.

  32. 32.

    Repeat problem 31 utilizing the affine projection algorithm with L = 4.

  33. 33.

    Derive a complex transform-domain LMS algorithm for the case the transformation matrix is the DFT.

  34. 34.

    The Quasi-Newton algorithm first proposed in [51] is described by the following set of equations

    $$\begin{array}{rcl} e(k)& =& d(k) -{{\bf {w}}}^{T}(k)\bf{x}(k) \\ \mu (k)& =& \frac{1} {2{\bf{x}}^{T}(k){\hat{\bf{R}}}^{-1}(k)\bf{x}(k)} \\ {\bf {w}}(k + 1)& =& {\bf {w}}(k) + 2\:\mu (k)\:e(k)\:{\hat{\bf{R}}}^{-1}(k)\bf{x}(k) \\ {\hat{\bf{R}}}^{-1}(k + 1)& =&{ \hat{\bf{R}}}^{-1}(k) - 2\mu (k)\left (1 - \mu (k)\right ){\hat{\bf{R}}}^{-1}(k)\bf{x}(k){\bf{x}}^{T}(k){\hat{\bf{R}}}^{-1}(k) \end{array}$$
    (4.165)
    1. (a)

      Apply this algorithm as well as the binormalized LMS algorithm to identify the system

      $$H(z) = 1 + {z}^{-1} + {z}^{-2}$$

    when the additional noise is a uniformly distributed white noise with variance σ n 2 = 0. 01, and the input signal is a Gaussian noise with unit variance filtered by an all-pole filter given by

    $$G(z) = \frac{0.19z} {z - 0.9}$$

    Through simulations, compare the convergence speed of the two algorithms when their misadjustments are approximately the same. The later condition can be met by choosing the μ in the binormalized LMS algorithm appropriately.

  35. 35.

    Show the update equation of a stochastic gradient algorithm designed to search the following objective function.

    $$\begin{array}{rcl} F[{\bf {w}}(k)] = a\vert d(k) -{{\bf {w}}}^{H}(k)\bf{x}(k){\vert }^{4} + b\vert d(k) -{{\bf {w}}}^{H}(k)\bf{x}(k){\vert }^{3}& & \\ \end{array}$$
  36. 36.
    1. (a)

      A normalized LMS algorithm with convergence factor equal to one receives the following data

      $$\begin{array}{rcl} \bf{x}(0) = \left [\begin{array}{c} 1\\ 2\end{array} \right ]& & \\ d(0) = 1& & \\ \end{array}$$

      and

      $$\begin{array}{rcl} \bf{x}(1) = \left [\begin{array}{c} 2\\ 1\end{array} \right ]& & \\ d(1) = 0& & \\ \end{array}$$

      with zero initial values for the coefficients. Determine the hyperplanes \(\mathcal{S}(k)\)

      $$\begin{array}{rcl} \mathcal{S}(k) =\{ {\bf {w}}(k + 1) \in {\mathbb{R}}^{2} : d(k) -{{\bf {w}}}^{T}(k + 1)\bf{x}(k) = 0\}& & \\ \end{array}$$

      for the two updates.

    2. (b)

      If these data belong to a system identification problem without additional noise, what would be the optimal coefficients of the unknown system?

  37. 37.

    An adaptive filter is employed to identify an unknown system of order 20 using sufficient order, and producing a misadjustment of 30%. Assume the input signal is a white Gaussian noise with unit variance and σ n 2 = 0. 01.

    1. (a)

      For an LMS algorithm what value of μ is required to obtain the desired result?

    2. (b)

      What about the value of μ for the affine projection algorithm with L = 2 and using (4.125)? Is this expression suitable for this case?

  38. 38.

    Given the updating equation

    $$\begin{array}{rcl} {\bf {w}}(k + 1) = {\bf {w}}(k) + \frac{{\mu }_{n}} {\gamma + {\sigma }_{x}^{2}(k)}\:e(k)\:\bf{x}(k)& & \\ \end{array}$$

    where \({\sigma }_{x}^{2}(k) = \alpha {x}^{2}(k) + (1 - \alpha ){\sigma }_{x}^{2}(k - 1)\), derive the objective function that the algorithm minimizes. Assume that γ ≈ 0 is included only for regularization purposes.

  39. 39.

    Derive an affine projection algorithm for real signals and one reuse (binormalized) employing a forgetting factor λ such that

    $$\begin{array}{rcl}{ \bf{X}}_{\mathrm{ap}}(k)& =& \left [\begin{array}{ccccc} x(k) & \lambda x(k - 1) \\ x(k - 1) & \lambda x(k - 2)\\ \vdots & \vdots \\ x(k - N)&\lambda x(k - N - 1)\end{array} \right ] \\ & =& [\bf{x}(k)\:\lambda \bf{x}(k - 1)] \\ \end{array}$$

    and

    $$\begin{array}{rcl}{ \bf{d}}_{\mathrm{ap}}(k)& =& \left [\begin{array}{c} d(k)\\ \lambda d(k - 1) \end{array} \right ]\\ \end{array}$$

    Describe in detail the objective function being minimized when a convergence factor μ is used.