Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

2.1 Introduction

This chapter includes a brief review of deterministic and random signal representations. Due to the extent of those subjects, our review is limited to the concepts that are directly relevant to adaptive filtering. The properties of thecorrelation matrix of the input signal vector are investigated in some detail, since they play a key role in the statistical analysis of the adaptive-filtering algorithms.

The Wiener solutionthat represents theminimum mean-square error (MSE) solution of discrete-time filters realized through a linear combiner is also introduced. This solution depends on the input signal correlation matrix as well as on the cross-correlation between the elements of the input signal vector and the reference signal. The values of these correlations form the parameters of theMSE surface, which is a quadratic function of the adaptive-filter coefficients.The linearly constrained Wiener filter is also presented, a technique commonly used in antenna array processing applications. The transformation of the constrained minimization problem into an unconstrained one is also discussed. Motivated by the importance of the properties of the MSE surface, we analyze them using some results related to the input signal correlation matrix.

In practice the parameters that determine the MSE surface shape are not available. What is left is to directly or indirectly estimate these parameters using the available data and to develop adaptive algorithms that use these estimates to search the MSE surface, such that the adaptive-filter coefficients converge to the Wiener solution in some sense. The starting point to obtain an estimation procedure is to investigate the convenience of using the classical searching methods of optimization theory [13] to adaptive filtering. The Newton andsteepest-descent algorithms are investigated as possible searching methods for adaptive filtering. Although both methods are not directly applicable to practical adaptive filtering, smart reflections inspired on them led to practical algorithms such as the least-mean-square (LMS) [45] and Newton-based algorithms. The Newton and steepest-descent algorithms are introduced in this chapter, whereas the LMS algorithm is treated in the next chapter.

Also, in the present chapter, the main applications of adaptive filters are revisited and discussed in greater detail.

2.2 Signal Representation

In this section, we briefly review some concepts related to deterministic and random discrete-time signals. Only specific results essential to the understanding of adaptive filtering are reviewed. For further details on signals and digital signal processing we refer to [613].

2.2.1 Deterministic Signals

A deterministic discrete-time signal is characterized by a defined mathematical function of the time index k,Footnote 1 with k = 0, ± 1, ± 2, ± 3, . An example of a deterministic signal (or sequence) is

$$\begin{array}{rcl} x(k) ={ \mathrm{e}}^{-\alpha \:k}\cos (\omega k) + u(k)& &\end{array}$$
(2.1)

where u(k) is the unit step sequence.

The response of a linear time-invariant filter to an input x(k) is given by the convolution summation, as follows [7]:

$$\begin{array}{rlrlrl} y(k) & = x(k) {_\ast} h(k) = \sum\limits_{n=-\infty }^{\infty }x(n)h(k - n) & & \\ & = \sum\limits_{n=-\infty }^{\infty }h(n)x(k - n) = h(k) {_\ast} x(k) &\end{array}$$
(2.2)

where h(k) is the impulse response of the filter.Footnote 2

The \(\mathcal{Z}\)-transform of a given sequence x(k) is defined as

$$\begin{array}{rcl} \mathcal{Z}\{x(k)\} = X(z) = \sum\limits_{k=-\infty }^{\infty }x(k){z}^{-k}& &\end{array}$$
(2.3)

for regions in the \(\mathcal{Z}\)-plane such that this summation converges. If the \(\mathcal{Z}\)-transform is defined for a given region of the \(\mathcal{Z}\)-plane, in other words the above summation converges in that region, the convolution operation can be replaced by a product of the \(\mathcal{Z}\)-transforms as follows [7]:

$$\begin{array}{rcl} Y (z) = H(z)\:X(z)& &\end{array}$$
(2.4)

where Y (z), X(z), and H(z) are the \(\mathcal{Z}\)-transforms of y(k), x(k), and h(k), respectively. Considering only waveforms that start at an instant k ≥ 0 and have finite power, their \(\mathcal{Z}\)-transforms will always be defined outside the unit circle.

For finite-energy waveforms, it is convenient to use the discrete-time Fourier transform defined as

$$\begin{array}{rcl} \mathcal{F}\{x(k)\} = X({\mathrm{e}}^{j\omega }) = \sum\limits_{k=-\infty }^{\infty }x(k){\mathrm{e}}^{-j\omega k}& &\end{array}$$
(2.5)

Although the discrete-time Fourier transform does not exist for a signal with infinite energy, if the signal has finite power, a generalized discrete-time Fourier transform exists and is largely used for deterministic signals [14].

2.2.2 Random Signals

A random variable X is a function that assigns a number to every outcome, denoted by ϱ, of a given experiment. Astochastic process is a rule to describe the time evolution of the random variable depending on ϱ, therefore it is a function of two variables X(k, ϱ). The set of all experimental outcomes, i.e., the ensemble, is the domain of ϱ. We denote x(k) as a sample of the given process with ϱ fixed, where in this case if k is also fixed, x(k) is a number. When any statistical operator is applied to x(k) it is implied that x(k) is a random variable, k is fixed, and ϱ is variable. In this book, x(k) represents a random signal.

Random signals do not have a precise description of their waveforms. What is possible is to characterize them via measured statistics or through a probabilistic model. For random signals, thefirst- and second-order statistics are most of the time sufficient for characterization of the stochastic process. The first- and second-order statistics are also convenient for measurements. In addition, the effect on these statistics caused by linear filtering can be easily accounted for as shown below.

Let’s consider for the time being that the random signals are real. We start to introduce some tools to deal with random signals by defining thedistribution function of a random variable as

$${P}_{x(k)}(y)\stackrel{\bigtriangleup }{=}\:\mathit{probability}\:\mathit{of }\:x(k)\:\mathit{being}\:\mathit{smaller}\:\mathit{or}\:\mathit{equal}\:\mathit{to}\:\mathit{y}$$

or

$${P}_{x(k)}(y) ={ \int \nolimits \nolimits }_{-\infty }^{y}{p}_{ x(k)}(z)dz$$
(2.6)

The derivative of the distribution function is the probability density function (pdf)

$${p}_{x(k)}(y) = \frac{d{P}_{x(k)}(y)} {dy}$$
(2.7)

The expected value, or mean value, of the process is defined by

$${m}_{x}(k) = E[x(k)]$$
(2.8)

The definition of the expected value is expressed as

$$E[x(k)] ={ \int \nolimits \nolimits }_{-\infty }^{\infty }y\:{p}_{ x(k)}(y)dy$$
(2.9)

where p x(k)(y) is the pdf of x(k) at the point y.

The autocorrelation function of the process x(k) is defined by

$${r}_{x}(k,l) = E[x(k)x(l)] ={ \int \nolimits \nolimits }_{-\infty }^{\infty }{\int \nolimits \nolimits }_{-\infty }^{\infty }yz{p}_{ x(k),x(l)}(y,z)dydz$$
(2.10)

where p x(k), x(l)(y, z) is the joint probability density of the random variables x(k) and x(l) defined as

$${p}_{x(k),x(l)}(y,z) = \frac{{\partial }^{2}{P}_{x(k),x(l)}(y,z)} {\partial y\partial z}$$
(2.11)

where

$${P}_{x(k),x(l)}(y,z)\stackrel{\bigtriangleup }{=}\mathit{probability}\:of\:\{x(k) \leq y\:and\:x(l) \leq z\}$$

The autocovariance function is defined as

$${\sigma }_{x}^{2}(k,l) = E\{[x(k) - {m}_{ x}(k)][x(l) - {m}_{x}(l)]\} = {r}_{x}(k,l) - {m}_{x}(k){m}_{x}(l)$$
(2.12)

where the second equality follows from the definitions of mean value and autocorrelation. For k = l, σ x 2(k, l) = σ x 2(k) which is the variance of x(k).

The most important specific example of probability density function is theGaussian density function, also known as normal density function [1516]. The Gaussian pdf is defined by

$${p}_{x(k)}(y) = \frac{1} {\sqrt{2\pi {\sigma }_{x }^{2 }(k)}}{\mathrm{e}}^{-\frac{{(y-{m}_{x}(k))}^{2}} {2{\sigma }_{x}^{2}(k)} }$$
(2.13)

where m x (k) and σ x 2(k) are the mean and variance of x(k), respectively.

One justification for the importance of the Gaussian distribution is the central limit theorem. Given a random variable x composed by the sum of n independent random variables x i as follows:

$$x = \sum\limits_{i=1}^{n}{x}_{ i}$$
(2.14)

the central limit theorem states that under certain general conditions, the probability density function of x approaches a Gaussian density function for large n. The mean and variance of x are given, respectively, by

$${m}_{x} = \sum\limits_{i=1}^{n}{m}_{{ x}_{i}}$$
(2.15)
$${\sigma }_{x}^{2} = \sum\limits_{i=1}^{n}{\sigma }_{{ x}_{i}}^{2}$$
(2.16)

Considering that the values of the mean and variance of x can grow, define

$${x}^{\prime} = \frac{x - {m}_{x}} {{\sigma }_{x}}$$
(2.17)

In this case, for n →  it follows that

$${p}_{{x}^{\prime}}(y) = \frac{1} {\sqrt{2\pi }}{\mathrm{e}}^{-\frac{{y}^{2}} {2} }$$
(2.18)

In a number of situations we require the calculation of conditional distributions, where the probability of a certain event to occur is calculated assuming that another event B has occurred. In this case, we define

$$\begin{array}{rcl}{ P}_{x(k)}(y\vert B)& = & \frac{P(\{x(k) \leq y\} \cap B)} {P(B)} \\ & \stackrel{\bigtriangleup }{=}& \:\mathit{probability}\:of\:x(k) \leq y\:\mathit{assuming}\:B\:\mathit{has}\:\mathit{occurred}\quad \end{array}$$
(2.19)

This joint event consists of all outcomes ϱ ∈ B such that x(k) = x(k, ϱ) ≤ y.Footnote 3The definition of the conditional mean is given by

$${m}_{x\vert B}(k) = E[x(k)\vert B] ={ \int \nolimits \nolimits }_{-\infty }^{\infty }y{p}_{ x(k)}(y\vert B)dy$$
(2.20)

where p x(k)(y | B) is the pdf of x(k) conditioned on B.

The conditional variance is defined as

$${\sigma }_{x\vert B}^{2}(k) = E\{{[x(k) - {m}_{ x\vert B}(k)]}^{2}\vert B\} ={ \int \nolimits \nolimits }_{-\infty }^{\infty }{[y - {m}_{ x\vert B}(k)]}^{2}{p}_{ x(k)}(y\vert B)dy$$
(2.21)

There are processes for which the mean and autocorrelation functions are shift (or time) invariant, i.e.,

$$\begin{array}{rcl} & {m}_{x}(k - i) = {m}_{x}(k) = E[x(k)] = {m}_{x}&\end{array}$$
(2.22)
$$\begin{array}{rcl} & {r}_{x}(k,i) = E[x(k - j)x(i - j)] = {r}_{x}(k - i) = {r}_{x}(l)&\end{array}$$
(2.23)

and as a consequence

$${\sigma }_{x}^{2}(l) = {r}_{ x}(l) - {m}_{x}^{2}$$
(2.24)

These processes are said to bewide-sense stationary (WSS).If the nth-order statistics of a process is shift invariant, the process is said to be nth-order stationary. Also if the process is nth-order stationary for any value of n, the process is stationary in strict sense.

Two processes are considered jointly WSS if and only if any linear combination of them is also WSS. This is equivalent to state that

$$y(k) = {k}_{1}\:{x}_{1}(k) + {k}_{2}\:{x}_{2}(k)$$
(2.25)

must be WSS, for any constants k 1 and k 2, if x 1(k) and x 2(k) are jointly WSS. This property implies that both x 1(k) and x 2(k) have shift-invariant means and autocorrelations, and that their cross-correlation is also shift invariant.

For complex signals where \(x(k) = {x}_{r}(k) + j{x}_{i}(k)\), \(y = {y}_{r} + j{y}_{i}\), and \(z = {z}_{r} + j{z}_{i}\), we have the following definition of the expected value

$$E[x(k)] ={ \int \nolimits \nolimits }_{-\infty }^{\infty }{\int \nolimits \nolimits }_{-\infty }^{\infty }y{p}_{{ x}_{r}(k),{x}_{i}(k)}({y}_{r},{y}_{i})d{y}_{r}d{y}_{i}$$
(2.26)

where \({p}_{{x}_{r}(k),{x}_{i}(k)}({y}_{r},{y}_{i})\) is the joint probability density function (pdf) of x r (k) and x i (k).

The autocorrelation function of the complex random signal x(k) is defined by

$$\begin{array}{rcl}{ r}_{x}(k,l)& =& E[x(k){x}^{{_\ast}}(l)] \\ & =& {\int \nolimits \nolimits }_{-\infty }^{\infty }{\int \nolimits \nolimits }_{-\infty }^{\infty }{\int \nolimits \nolimits }_{-\infty }^{\infty }{\int \nolimits \nolimits }_{-\infty }^{\infty }y{z}^{{_\ast}}{p}_{{ x}_{r}(k),{x}_{i}(k),{x}_{r}(l),{x}_{i}(l)}({y}_{r},{y}_{i},{z}_{r},{z}_{i})d{y}_{r}d{y}_{i}d{z}_{r}d{z}_{i} \\ & & \end{array}$$
(2.27)

where ∗ denotes complex conjugate, since we assume for now that we are dealing with complex signals, and \({p}_{{x}_{r}(k),{x}_{i}(k),{x}_{r}(l),{x}_{i}(l)}({y}_{r},{y}_{i},{z}_{r},{z}_{i})\) is the joint probability density function of the random variables x(k) and x(l).

For complex signals the autocovariance function is defined as

$$\begin{array}{rcl}{ \sigma }_{x}^{2}(k,l) = E\{[x(k) - {m}_{ x}(k)]{[x(l) - {m}_{x}(l)]}^{{_\ast}}\} = {r}_{ x}(k,l) - {m}_{x}(k){m}_{x}^{{_\ast}}(l)\qquad & &\end{array}$$
(2.28)

2.2.2.1 Autoregressive Moving Average Process

The process resulting from the output of a system described by a general linear difference equation given by

$$y(k) = \sum\limits_{j=0}^{M}{b}_{ j}x(k - j) + \sum\limits_{i=1}^{N}{a}_{ i}y(k - i)$$
(2.29)

where x(k) is a white noise, is called autoregressive moving average (ARMA) process. The coefficients a i and b j are the parameters of the ARMA process. The output signal y(k) is also said to be a colored noise since the autocorrelation function of y(k) is nonzero for a lag different from zero, i.e., r(l)≠0 for some l≠0.

For the special case where b j  = 0 for j = 1, 2, , M, the resulting process is called autoregressive (AR) process. The terminology means that the process depends on the present value of the input signal and on a linear combination of past samples of the process. This indicates the presence of a feedback of the output signal.

For the special case where a i  = 0 for i = 1, 2, , N, the process is identified as amoving average (MA) process. This terminology indicates that the process depends on a linear combination of the present and past samples of the input signal. In summary, an ARMA process can be generated by applying a white noise to the input of a digital filter with poles and zeros, whereas for the AR and MA cases the digital filters are all-pole and all-zero filters, respectively.

2.2.2.2 Markov Process

A stochastic process is called a Markov process if its past has no influence in the future when the present is specified [1514]. In other words, the present behavior of the process depends only on the most recent past, i.e., all behavior previous to the most recent past is not required. A first-order AR process is a first-order Markov process, whereas an Nth-order AR process is considered an Nth-order Markov process. Take as an example the sequence

$$\begin{array}{rcl} y(k) = ay(k - 1) + n(k)& &\end{array}$$
(2.30)

where n(k) is a white-noise process. The process represented by y(k) is determined by y(k − 1) and n(k), and no information before the instant k − 1 is required. We conclude that y(k) represents a Markov process. In the previous example, if a = 1 and \(y(-1) = 0\) the signal y(k), for k ≥ 0, is a sum of white noise samples, usually called random walk sequence.

Formally, an mth-order Markov process satisfies the following condition: for all k ≥ 0, and for a fixed m, it follows that

$$\begin{array}{rcl} & & {P}_{x(k)}\left (y\vert x(k - 1),x(k - 2),\ldots,x(0)\right ) \\ & & \quad = {P}_{x(k)}\left (y\vert x(k - 1),x(k - 2),\ldots,x(k - m)\right )\end{array}$$
(2.31)

2.2.2.3 Wold Decomposition

Another important result related to any WSS process x(k) is the Wold decomposition, which states that x(k) can be decomposed as

$$x(k) = {x}_{r}(k) + {x}_{p}(k)$$
(2.32)

where x r (k) is a regular process that is equivalent to the response of a stable, linear, time-invariant, and causal filter to a white noise [14], and x p (k) is a perfectly predictable (deterministic or singular) process. Also, x p (k) and x r (k) are orthogonal processes, i.e., E[x r (k)x p (k)] = 0. The key factor here is that the regular process can be modeled through a stable autoregressive model [17] with a stable and causal inverse. The importance of Wold decomposition lies on the observation that a WSS process can in part be represented by an AR process of adequate order, with the remaining part consisting of a perfectly predictable process. Obviously, the perfectly predictable process part of x(k) also admits an AR model with zero excitation.

2.2.2.4 Power Spectral Density

Stochastic signals that are WSS are persistent and therefore are not finite-energy signals. On the other hand, they have finite power such that the generalized discrete-time Fourier transform can be applied to them. When the generalized discrete-time Fourier transform is applied to a WSS process it leads to a random function of the frequency [14]. On the other hand, the autocorrelation functions of most practical stationary processes have discrete-time Fourier transform. Therefore, the discrete-time Fourier transform of the autocorrelation function of a stationary random process can be very useful in many situations. This transform, called power spectral density, is defined as

$${R}_{x}({\mathrm{e}}^{j\omega }) = \sum\limits_{l=-\infty }^{\infty }{r}_{ x}(l){\mathrm{e}}^{-j\omega l} = \mathcal{F}[{r}_{ x}(l)]$$
(2.33)

where r x (l) is the autocorrelation of the process represented by x(k). The inverse discrete-time Fourier transform allows us to recover r x (l) from R x (ejω), through the relation

$${r}_{x}(l) = \frac{1} {2\pi }{\int \nolimits \nolimits }_{-\pi }^{\pi }{R}_{ x}({\mathrm{e}}^{j\omega }){\mathrm{e}}^{j\omega l}d\omega = {\mathcal{F}}^{-1}[{R}_{ x}({\mathrm{e}}^{j\omega })]$$
(2.34)

It should be mentioned that R x (ejω) is a deterministic function of ω and can be interpreted as the power density of the random process at a given frequency in the ensemble,Footnote 4 i.e., considering the average outcome of all possible realizations of the process. In particular, the mean squared value of the process represented by x(k) is given by

$${r}_{x}(0) = \frac{1} {2\pi }{\int \nolimits \nolimits }_{-\pi }^{\pi }{R}_{ x}({\mathrm{e}}^{j\omega })d\omega $$
(2.35)

If the random signal representing any single realization of a stationary process is applied as input to a linear and time-invariant filter, with impulse response h(k), the following equalities are valid and can be easily verified:

$$\begin{array}{rcl} y(k)& =& \sum\limits_{n=-\infty }^{\infty }x(n)h(k - n) = x(k) {_\ast} h(k)\end{array}$$
(2.36)
$$\begin{array}{rcl}{ r}_{y}(l)& =& {r}_{x}(l) {_\ast} {r}_{h}(l)\end{array}$$
(2.37)
$$\begin{array}{rcl}{ R}_{y}({\mathrm{e}}^{j\omega })& =& {R}_{ x}({\mathrm{e}}^{j\omega })\vert H({\mathrm{e}}^{j\omega }){\vert }^{2}\end{array}$$
(2.38)
$$\begin{array}{rcl}{ r}_{yx}(l)& =& {r}_{x}(l) {_\ast} h(l) = E[{x}^{{_\ast}}(k)y(k + l)]\end{array}$$
(2.39)
$$\begin{array}{rcl}{ R}_{yx}({\mathrm{e}}^{j\omega })& =& {R}_{ x}({\mathrm{e}}^{j\omega })H({\mathrm{e}}^{j\omega })\end{array}$$
(2.40)

where \({r}_{h}(l) = h(l) {_\ast} h(-l)\), R y (ejω) is the power spectral density of the output signal, r yx (k) is the cross-correlation of x(k) and y(k), and R yx (ejω) is the corresponding cross-power spectral density.

The main feature of the spectral density function is to allow a simple analysis of the correlation behavior of WSS random signals processed with linear time-invariant systems. As an illustration, suppose a white noise is applied as input to a lowpass filter with impulse response h(k) and sharp cutoff at a given frequency ω l . The autocorrelation function of the output signal y(k) will not be a single impulse, it will be h(k) ∗ h( − k). Therefore, the signal y(k) will look likea band-limited random signal, in this case, a slow-varying noise. Some properties of the function R x (ejω) of a discrete-time and stationary stochastic process are worth mentioning. The power spectrum density is a periodic function of ω, with period 2π, as can be verified from its definition. Also, since for a stationary and complex random process we have \({r}_{x}(-l) = {r}_{x}^{{_\ast}}(l)\), R x (ejω) is real. Despite the usefulness of the power spectrum density function in dealing with WSS processes, it will not be widely used in this book since usually the filters considered here are time varying. However, it should be noted its important role in areas such as spectrum estimation [1819].

If the \(\mathcal{Z}\)-transforms of the autocorrelation and cross-correlation functions exist, we can generalize the definition of power spectral density. In particular, the definition of (2.33) corresponds to the following relation

$$\begin{array}{rcl} \mathcal{Z}[{r}_{x}(k)] = {R}_{x}(z) = \sum\limits_{k=-\infty }^{\infty }{r}_{ x}(k){z}^{-k}& &\end{array}$$
(2.41)

If the random signal representing any single realization of a stationary process is applied as input to a linear and time-invariant filter with impulse response h(k), the following equalities are valid:

$$\begin{array}{rcl}{ R}_{y}(z) = {R}_{x}(z)H(z)H({z}^{-1})& &\end{array}$$
(2.42)

and

$${R}_{yx}(z) = {R}_{x}(z)H(z)$$
(2.43)

where \(H(z) = \mathcal{Z}[h(l)]\). If we wish to calculate the cross-correlation of y(k) and x(k), namely r yx (0), we can use the inverse \(\mathcal{Z}\)-transform formula as follows:

$$\begin{array}{rcl} E[y(k){x}^{{_\ast}}(k)]& =& \frac{1} {2\pi j}\oint \nolimits {R}_{yx}(z)\frac{dz} {z} \\ & =& \frac{1} {2\pi j}\oint \nolimits H(z){R}_{x}(z)\frac{dz} {z}\end{array}$$
(2.44)

where the integration path is a counterclockwise closed contour in the region of convergence of R yx (z). The contour integral above equation is usually solved through the Cauchy’s residue theorem [8].

2.2.3 Ergodicity

In the probabilistic approach, the statistical parameters of the real data are obtained through ensemble averages (or expected values). The estimation of any parameter of the stochastic process can be obtained by averaging a large number of realizations of the given process, at each instant of time. However, in many applications only a few or even a single sample of the process is available. In these situations, we need to find out in which cases the statistical parameters of the process can be estimated by using time average of a single sample (or ensemble member) of the process. This is obviously not possible if the desired parameter is time varying. The equivalence between the ensemble average and time average is called ergodicity [1514].

The time average of a given stationary process represented by x(k) is calculated by

$$\hat{{m}}_{{x}_{N}} = \frac{1} {2N + 1}\sum\limits_{k=-N}^{N}x(k)$$
(2.45)

If

$${\sigma }_{\hat{{m}}_{{x}_{ N}}}^{2} =\lim\limits_{ N\rightarrow \infty }E\{\vert \hat{{m}}_{{x}_{N}} - {m}_{x}{\vert }^{2}\} = 0$$

the process is said to be mean-ergodic in the mean-square sense.Therefore, the mean-ergodic process has time average that approximates the ensemble average as N → . Obviously, \(\hat{{m}}_{{x}_{N}}\) is an unbiased estimate of m x since

$$E[\hat{{m}}_{{x}_{N}}] = \frac{1} {2N + 1}\sum\limits_{k=-N}^{N}E[x(k)] = {m}_{ x}$$
(2.46)

Therefore, the process will be considered ergodic if the variance of \(\hat{{m}}_{{x}_{N}}\) tends to zero (\({\sigma }_{\hat{{m}}_{{x}_{ N}}}^{2} \rightarrow 0\)) when N → . The variance \({\sigma }_{\hat{{m}}_{{x}_{ N}}}^{2}\) can be expressed after some manipulations as

$${\sigma }_{\hat{{m}}_{{x}_{ N}}}^{2} = \frac{1} {2N + 1}\sum\limits_{l=-2N}^{2N}{\sigma }_{ x}^{2}(k + l,k)\left (1 - \frac{\vert l\vert } {2N + 1}\right )$$
(2.47)

where σ x 2(k + l, k) is the autocovariance of the stochastic process x(k). The variance of \(\hat{{m}}_{{x}_{N}}\) tends to zero if and only if

$$\lim\limits_{N\rightarrow \infty } \frac{1} {N}\sum\limits_{l=0}^{N}{\sigma }_{ x}^{2}(k + l,k) \rightarrow 0$$

The above condition is necessary and sufficient to guarantee that the process is mean-ergodic.

The ergodicity concept can be extended to higher order statistics. In particular, for second-order statistics we can define the process

$$\begin{array}{rcl}{ x}_{l}(k) = x(k + l){x}^{{_\ast}}(k)& &\end{array}$$
(2.48)

where the mean of this process corresponds to the autocorrelation of x(k), i.e., r x (l). Mean-ergodicity of x l (k) implies mean-square ergodicity of the autocorrelation of x(k).

The time average of x l (k) is given by

$$\hat{{m}}_{{x}_{l,N}} = \frac{1} {2N + 1}\sum\limits_{k=-N}^{N}{x}_{ l}(k)$$
(2.49)

that is an unbiased estimate of r x (l). If the variance of \(\hat{{m}}_{{x}_{l,N}}\) tends to zero as N tends to infinity, the process x(k) is said to be mean-square ergodic of the autocorrelation, i.e.,

$$ \begin{array}{rcl} \lim\limits_{N\rightarrow \infty }E\{\vert \hat{{m}}_{{x}_{l,N}} - {r}_{x}(l){\vert }^{2}\} = 0& &\end{array}$$
(2.50)

The above condition is satisfied if and only if

$$ \begin{array}{rcl} \lim\limits_{N\rightarrow \infty } \frac{1} {N}\sum\limits_{i=0}^{N}E\{x(k + l){x}^{{_\ast}}(k)x(k + l + i){x}^{{_\ast}}(k + i)\} - {r}_{ x}^{2}(l) = 0& &\end{array}$$
(2.51)

where it is assumed that x(n) has stationary fourth-order moments.The concept of ergodicity can be extended to nonstationary processes [14], however, that is beyond the scope of this book.

2.3 The Correlation Matrix

Usually, adaptive filters utilize the available input signals at instant k in their updating equations. These inputs are the elements of the input signal vector denoted by

$$\begin{array}{rcl} \bf{x}(k) = {[{x}_{0}(k)\:{x}_{1}(k)\ldots {x}_{N}(k)]}^{T}& & \\ \end{array}$$

The correlation matrix is defined as R = E[x(k)x H(k)], where x H(k) is the Hermitian transposition of x(k), that means transposition followed by complex conjugation or vice versa. As will be noted, the characteristics of the correlation matrix play a key role in the understanding of properties of most adaptive-filtering algorithms. As a consequence, it is important to examine the main properties of the matrix R. Some properties of the correlation matrix come from the statistical nature of the adaptive-filtering problem, whereas other properties derive from the linear algebra theory.

For a given input vector, the correlation matrix is given by

$$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{cccc} E[\vert {x}_{0}(k){\vert }^{2}] & E[{x}_{0}(k){x}_{1}^{{_\ast}}(k)] &\cdots &E[{x}_{0}(k){x}_{N}^{{_\ast}}(k)] \\ E[{x}_{1}(k){x}_{0}^{{_\ast}}(k)] & E[\vert {x}_{1}(k){\vert }^{2}] &\cdots &E[{x}_{1}(k){x}_{N}^{{_\ast}}(k)]\\ \vdots & \vdots & \ddots & \vdots \\ E[{x}_{N}(k){x}_{0}^{{_\ast}}(k)]&E[{x}_{N}(k){x}_{1}^{{_\ast}}(k)]&\cdots & E[\vert {x}_{N}(k){\vert }^{2}]\\ \end{array} \right ] \\ & =& E[\bf{x}(k){\bf{x}}^{H}(k)] \end{array}$$
(2.52)

The main properties of the R matrix are listed below:

  1. 1.

    The matrix R is positive semidefinite.

    Proof.

    Given an arbitrary complex weight vector w, we can form a signal given by

    $$y(k) ={ \bf{w}}^{H}\bf{x}(k)$$

    The magnitude squared of y(k) is

    $$y(k){y}^{{_\ast}}(k) = \vert y(k){\vert }^{2} ={ \bf{w}}^{H}\bf{x}(k){\bf{x}}^{H}(k)\bf{w} \geq 0$$

    The mean-square (MS) value of y(k) is then given by

    $$\mathrm{MS}[y(k)] = E[\vert y(k){\vert }^{2}] ={ \bf{w}}^{H}E[\bf{x}(k){\bf{x}}^{H}(k)]\bf{w} ={ \bf{w}}^{H}\bf{R}\bf{w} \geq 0$$

    Therefore, the matrix R is positive semidefinite.

    Usually, the matrix R is positive definite, unless the signals that compose the input vector are linearly dependent. Linear-dependent signals are rarely found in practice.

  2. 2.

    The matrix R is Hermitian, i.e.,

    $$\bf{R} ={ \bf{R}}^{H}$$
    (2.53)

    Proof.

    $$\begin{array}{rcl}{ \bf{R}}^{H} = E\{{[\bf{x}(k){\bf{x}}^{H}(k)]}^{H}\} = E[\bf{x}(k){\bf{x}}^{H}(k)] = \bf{R}& & \\ \end{array}$$
  3. 3.

    A matrix is Toeplitz if the elements of the main diagonal and of any secondary diagonal are equal. When the input signal vector is composed of delayed versions of the same signal (i.e., \({x}_{i}(k) = {x}_{0}(k - i)\), for i = 1, 2, , N) taken from a WSS process, matrix R is Toeplitz.

    Proof.

    For the delayed signal input vector, with x(k) WSS, matrix R has the following form

    $$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{cccc} {r}_{x}(0) & {r}_{x}(1) &\cdots & {r}_{x}(N) \\ {r}_{x}(-1) & {r}_{x}(0) &\cdots &{r}_{x}(N - 1)\\ \vdots & \vdots & \ddots & \vdots \\ {r}_{x}(-N)&{r}_{x}(-N + 1)&\cdots & {r}_{x}(0)\\ \end{array} \right ] \end{array}$$
    (2.54)

    By examining the right-hand side of the above equation, we can easily conclude that R is Toeplitz.

Note that \({r}_{x}^{{_\ast}}(i) = {r}_{x}(-i)\), what also follows from the fact that the matrix R is Hermitian.

If matrix R given by (2.54) is nonsingular for a given N, the input signal is said to be persistently exciting of order N + 1. This means that the power spectral density R x (ejω) is different from zero at least at N + 1 points in the interval 0 < ω ≤ 2π. It also means that a nontrivial Nth-order FIR filter (with at least one nonzero coefficient) cannot filter x(k) to zero. Note that a nontrivial filter, with x(k) as input, would require at least N + 1 zeros in order to generate an output with all samples equal to zero.The absence of persistence of excitation implies the misbehavior of some adaptive algorithms [2021]. The definition of persistence of excitation is not unique, and it is algorithm dependent (see the book by Johnson [20] for further details).

From now on in this section, we discuss some properties of the correlation matrix related to its eigenvalues and eigenvectors. A number λ is an eigenvalue of the matrix R, with a corresponding eigenvector q, if and only if

$$\bf{R}\bf{q} = \lambda \bf{q}$$
(2.55)

or equivalently

$$\mathrm{det}(\bf{R} - \lambda \bf{I}) = 0$$
(2.56)

where I is the (N + 1) by (N + 1) identity matrix. Equation (2.56) is called characteristic equation of R and has (N + 1) solutions for λ. We denote the (N + 1) eigenvalues of R by λ0, λ1, , λ N . Note also that for every value of λ, the vector q = 0 satisfies (2.55); however, we consider only those particular values of λ that are linked to a nonzero eigenvector q.

Some important properties related to the eigenvalues and eigenvectors of R, which will be useful in the following chapters, are listed below.

  1. 1.

    The eigenvalues of R m are λ i m, for i = 0, 1, 2, , N.

    Proof.

    By premultiplying (2.55) by R m − 1, we obtain

    $$\begin{array}{rcl}{ \bf{R}}^{m-1}\bf{R}{\bf{q}}_{ i}& =&{ \bf{R}}^{m-1}{\lambda }_{ i}{\bf{q}}_{i} = {\lambda }_{i}{\bf{R}}^{m-2}\bf{R}{\bf{q}}_{ i} \\ & =& {\lambda }_{i}{\bf{R}}^{m-2}{\lambda }_{ i}{\bf{q}}_{i} = {\lambda }_{i}^{2}{\bf{R}}^{m-3}\bf{R}{\bf{q}}_{ i} \\ & =& \cdots = {\lambda }_{i}^{m}{\bf{q}}_{ i} \end{array}$$
    (2.57)
  2. 2.

    Suppose R has N + 1 linearly independent eigenvectors q i ; then if we form a matrix Q with columns consisting of the q i ’s, it follows that

    $${ \bf{Q}}^{-1}\bf{R}\bf{Q} = \left [\begin{array}{cccc} {\lambda }_{0} & 0 &\cdots & 0 \\ 0 &{\lambda }_{1} & & \vdots \\ \vdots & 0 &\cdots & \vdots\\ \vdots & \vdots & & 0 \\ 0 & 0 &\cdots &{\lambda }_{N}\\ \end{array} \right ] = {\Lambda }$$
    (2.58)

    Proof.

    $$\begin{array}{rcl} \bf{R}\bf{Q}& =& \bf{R}[{\bf{q}}_{0}\:{\bf{q}}_{1}\cdots {\bf{q}}_{N}] = [{\lambda }_{0}{\bf{q}}_{0}\:{\lambda }_{1}{\bf{q}}_{1}\cdots {\lambda }_{N}{\bf{q}}_{N}] \\ & =& \bf{Q}\left [\begin{array}{cccc} {\lambda }_{0} & 0 &\cdots & 0 \\ 0 &{\lambda }_{1} & & \vdots \\ \vdots & 0 &\cdots & \vdots\\ \vdots & \vdots & & 0 \\ 0 & 0 &\cdots &{\lambda }_{N}\\ \end{array} \right ] = \bf{Q}{\Lambda }\\ \end{array}$$

    Therefore, since Q is invertible because the q i ’s are linearly independent, we can show that

    $${ \bf{Q}}^{-1}\bf{R}\bf{Q} = {\Lambda } \square $$
  3. 3.

    The nonzero eigenvectors q 0, q 1, q N that correspond to different eigenvalues are linearly independent.

    Proof.

    If we form a linear combination of the eigenvectors such that

    $${a}_{0}{\bf{q}}_{0} + {a}_{1}{\bf{q}}_{1} + \cdots + {a}_{N}{\bf{q}}_{N} = \bf{0}$$
    (2.59)

    By multiplying the above equation by R we have

    $$\begin{array}{rlrlrl} {a}_{0}\bf{R}{\bf{q}}_{0} + {a}_{1}\bf{R}{\bf{q}}_{1} + \cdots + {a}_{N}\bf{R}{\bf{q}}_{N} = {a}_{0}{\lambda }_{0}{\bf{q}}_{0} + {a}_{1}{\lambda }_{1}{\bf{q}}_{1} + \cdots + {a}_{N}{\lambda }_{N}{\bf{q}}_{N} = \bf{0} & & \\ & & \end{array}$$
    (2.60)

    Now by multiplying (2.59) by λ N and subtracting the result from (2.60), we obtain

    $$\begin{array}{rcl}{ a}_{0}({\lambda }_{0} - {\lambda }_{N}){\bf{q}}_{0} + {a}_{1}({\lambda }_{1} - {\lambda }_{N}){\bf{q}}_{1} + \cdots + {a}_{N-1}({\lambda }_{N-1} - {\lambda }_{N}){\bf{q}}_{N-1} = \bf{0}& & \\ \end{array}$$

    By repeating the above steps, i.e., multiplying the above equation by R in one instance and by λ N − 1 on the other instance, and subtracting the results, it yields

    $$\begin{array}{rcl} & & {a}_{0}({\lambda }_{0} - {\lambda }_{N})({\lambda }_{0} - {\lambda }_{N-1}){\bf{q}}_{0} + {a}_{1}({\lambda }_{1} - {\lambda }_{N})({\lambda }_{1} - {\lambda }_{N-1}){\bf{q}}_{1} \\ & & \quad + \cdots + {a}_{N-2}({\lambda }_{N-2} - {\lambda }_{N-1}){\bf{q}}_{N-2} = \bf{0} \\ \end{array}$$

    By repeating the same above steps several times, we end up with

    $${a}_{0}({\lambda }_{0} - {\lambda }_{N})({\lambda }_{0} - {\lambda }_{N-1})\cdots ({\lambda }_{0} - {\lambda }_{1}){\bf{q}}_{0} = \bf{0}$$

    Since we assumed λ0≠λ1, λ0≠λ2, λ0≠λ N , and q 0 was assumed nonzero, then a 0 = 0.

    The same line of thought can be used to show that \({a}_{0} = {a}_{1} = {a}_{2} = \cdots = {a}_{N} = 0\) is the only solution for (2.59). Therefore, the eigenvectors corresponding to different eigenvalues are linearly independent.

    Not all matrices are diagonalizable. A matrix of order (N + 1) is diagonalizable if it possesses (N + 1) linearly independent eigenvectors. A matrix with repeated eigenvalues can be diagonalized or not, depending on the linear dependency of the eigenvectors. A nondiagonalizable matrix is called defective [22].

  4. 4.

    Since the correlation matrix R is Hermitian, i.e., R H = R, its eigenvalues are real. These eigenvalues are equal to or greater than zero given that R is positive semidefinite.

    Proof.

    First note that given an arbitrary complex vector w,

    $$\begin{array}{rcl}{ ({\bf{w}}^{H}\bf{R}\bf{w})}^{H} ={ \bf{w}}^{H}{\bf{R}}^{H}{({\bf{w}}^{H})}^{H} ={ \bf{w}}^{H}\bf{R}\bf{w}& & \\ \end{array}$$

    Therefore, w H Rw is a real number. Assume now that λ i is an eigenvalue of R corresponding to the eigenvector q i , i.e., Rq i  = λ i q i . By premultiplying this equation by q i H, it follows that

    $${\bf{q}}_{i}^{H}\bf{R}{\bf{q}}_{ i} = {\lambda }_{i}{\bf{q}}_{i}^{H}{\bf{q}}_{ i} = {\lambda }_{i}\|{\bf{q}{}_{i}\|}^{2}$$

    where the operation \(\|{\mathbf{a}\|}^{2} = \vert {a}_{0}{\vert }^{2} + \vert {a}_{1}{\vert }^{2} + \cdots + \vert {a}_{N}{\vert }^{2}\) is the Euclidean norm squared of the vector a, that is always real. Since the term on the left hand is also real, \(\|{\bf{q}{}_{i}\|}^{2}\neq 0\), and R is positive semidefinite, we can conclude that λ i is real and nonnegative.

    Note that Q is not unique since each q i can be multiplied by an arbitrary nonzero constant, and the resulting vector continues to be an eigenvector.Footnote 5 For practical reasons, we consider only normalized eigenvectors having length one, that is

    $${ \bf{q}}_{i}^{H}{\bf{q}}_{ i} = 1\:\:\:\:\:\mathrm{for}\:i = 0,1,\ldots,N$$
    (2.61)
  5. 5.

    If R is a Hermitian matrix with different eigenvalues, the eigenvectors are orthogonal to each other. As a consequence, there is a diagonalizing matrix Q that is unitary, i.e., Q H Q = I.

    Proof.

    Given two eigenvalues λ i and λ j , it follows that

    $$\bf{R}{\bf{q}}_{i} = {\lambda }_{i}{\bf{q}}_{i}$$

    and

    $$\bf{R}{\bf{q}}_{j} = {\lambda }_{j}{\bf{q}}_{j}$$
    (2.62)

    Using the fact that R is Hermitian and that λ i and λ j are real, then

    $${\bf{q}}_{i}^{H}\bf{R} = {\lambda }_{ i}{\bf{q}}_{i}^{H}$$

    and by multiplying this equation on the right by q j , we get

    $${\bf{q}}_{i}^{H}\bf{R}{\bf{q}}_{ j} = {\lambda }_{i}{\bf{q}}_{i}^{H}{\bf{q}}_{ j}$$

    Now by premultiplying (2.62) by q i H, it follows that

    $${\bf{q}}_{i}^{H}\bf{R}{\bf{q}}_{ j} = {\lambda }_{j}{\bf{q}}_{i}^{H}{\bf{q}}_{ j}$$

    Therefore,

    $${\lambda }_{i}{\bf{q}}_{i}^{H}{\bf{q}}_{ j} = {\lambda }_{j}{\bf{q}}_{i}^{H}{\bf{q}}_{ j}$$

    Since λ i ≠λ j , it can be concluded that

    $${\bf{q}}_{i}^{H}{\bf{q}}_{ j} = 0\:\:\:\:\:\mathrm{for}\:i\neq j$$

    If we form matrix Q with normalized eigenvectors, matrix Q is a unitary matrix.

    An important result is that any Hermitian matrix R can be diagonalized by a suitable unitary matrix Q, even if the eigenvalues of R are not distinct. The proof is omitted here and can be found in [22]. Therefore, for Hermitian matrices with repeated eigenvalues it is always possible to find a complete set of orthonormal eigenvectors.

    A useful form to decompose a Hermitian matrix that results from the last property is

    $$\bf{R} = \bf{Q}{\Lambda }{\bf{Q}}^{H} = \sum\limits_{i=0}^{N}{\lambda }_{ i}{\bf{q}}_{i}{\bf{q}}_{i}^{H}$$
    (2.63)

    that is known as spectral decomposition.From this decomposition, one can easily derive the following relation

    $${ \bf{w}}^{H}\bf{R}\bf{w} = \sum\limits_{i=0}^{N}{\lambda }_{ i}{\bf{w}}^{H}{\bf{q}}_{ i}{\bf{q}}_{i}^{H}\bf{w} = \sum\limits_{i=0}^{N}{\lambda }_{ i}\vert {\bf{w}}^{H}{\bf{q}}_{ i}{\vert }^{2}$$
    (2.64)

    In addition, since \({\bf{q}}_{i} = {\lambda }_{i}{\bf{R}}^{-1}{\bf{q}}_{i}\), the eigenvectors of a matrix and of its inverse coincide, whereas the eigenvalues are reciprocals of each other. As a consequence,

    $${ \bf{R}}^{-1} = \sum\limits_{i=0}^{N} \frac{1} {{\lambda }_{i}}{\bf{q}}_{i}{\bf{q}}_{i}^{H}$$
    (2.65)

    Another consequence of the unitary property of Q for Hermitian matrices is that any Hermitian matrix can be written in the form

    $$\begin{array}{rcl} \bf{R}& =& \left [\sqrt{{\lambda }_{0}}{\bf{q}}_{0}\:\sqrt{{\lambda }_{1}}{\bf{q}}_{1}\ldots \sqrt{{\lambda }_{N}}{\bf{q}}_{N}\right ]\left [\begin{array}{c} \sqrt{{\lambda }_{0}}{\bf{q}}_{0}^{H} \\ \sqrt{{\lambda }_{1}}{\bf{q}}_{1}^{H}\\ \vdots \\ \sqrt{{\lambda }_{N}}{\bf{q}}_{N}^{H}\\ \end{array} \right ] \\ & =& \mathbf{L}{\mathbf{L}}^{H} \end{array}$$
    (2.66)
  6. 6.

    The sum of the eigenvalues of R is equal to the trace of R, and the product of the eigenvalues of R is equal to the determinant of R.Footnote 6

    Proof.

    $$\mathrm{tr}[{\bf{Q}}^{-1}\bf{R}\bf{Q}] = \mathrm{tr}[{\Lambda }]$$

    where, \(\mathrm{tr}[\mathbf{A}] ={ \sum \nolimits }_{i=0}^{N}{a}_{ii}\). Since tr[A A] = tr[AA ], we have

    $$\mathrm{tr}[{\bf{Q}}^{-1}\bf{R}\bf{Q}] = \mathrm{tr}[\bf{R}\bf{Q}{\bf{Q}}^{-1}] = \mathrm{tr}[\bf{R}\bf{I}] = \mathrm{tr}[\bf{R}] = \sum\limits_{i=0}^{N}{\lambda }_{ i}$$

    Also

    $$\mathrm{det}[{\bf{Q}}^{-1}\:\bf{R}\:\bf{Q}] = \mathrm{det}[\bf{R}]\:\mathrm{det}[\bf{Q}]\:\mathrm{det}[{\bf{Q}}^{-1}] = \mathrm{det}[\bf{R}] = \mathrm{det}[{\Lambda }] = \prod\limits_{i=0}^{N}{\lambda }_{ i}.$$
  7. 7.

    The Rayleigh’s quotient defined as

    $$\mathcal{R} = \frac{{\bf{w}}^{H}\bf{R}\bf{w}} {{\bf{w}}^{H}\bf{w}}$$
    (2.67)

    of a Hermitian matrix is bounded by the minimum and maximum eigenvalues, i.e.,

    $${\lambda }_{\mathrm{min}} \leq \mathcal{R}\leq {\lambda }_{\mathrm{max}}$$
    (2.68)

    where the minimum and maximum values are reached when the vector w is chosen to be the eigenvector corresponding to the minimum and maximum eigenvalues, respectively.

    Proof.

    Suppose w = Qw , where Q is the matrix that diagonalizes R, then

    $$\begin{array}{rcl} \mathcal{R}& =& \frac{{{\bf{w}}^{{\prime}}}^{H}{\bf{Q}}^{H}\bf{R}\bf{Q}{\bf{w}}^{{\prime}}} {{{\bf{w}}^{{\prime}}}^{H}{\bf{Q}}^{H}\bf{Q}{\bf{w}}^{{\prime}}} \\ & =& \frac{{{\bf{w}}^{{\prime}}}^{H}{\Lambda }{\bf{w}}^{{\prime}}} {{{\bf{w}}^{{\prime}}}^{H}{\bf{w}}^{{\prime}}} \\ & =& \frac{{\sum \nolimits }_{i=0}^{N}{\lambda }_{i}{{w}_{i}^{{\prime}}}^{2}} {{\sum \nolimits }_{i=0}^{N}{{w}_{i}^{{\prime}}}^{2}} \end{array}$$
    (2.69)

    It is then possible to show, see Problem 14, that the minimum value for the above equation occurs when w i  = 0 for ij and λ j is the smallest eigenvalue. Identically, the maximum value for \(\mathcal{R}\) occurs when w i  = 0 for il, where λ l is the largest eigenvalue.

    There are several ways to define the norm of a matrix. In this book the norm of a matrix R, denoted by \(\|\bf{R}\|\), is defined by

    $$\begin{array}{rcl} \|{\bf{R}\|}^{2}& =& \max\limits_{\mathbf{ w}\neq 0}\frac{\|\bf{R}{\bf{w}\|}^{2}} {\|{\bf{w}\|}^{2}} \\ & =& \max\limits_{\bf{w}\neq 0}\frac{{\bf{w}}^{H}{\bf{R}}^{H}\bf{R}\bf{w}} {{\bf{w}}^{H}\bf{w}} \end{array}$$
    (2.70)

    Note that the norm of R is a measure of how a vector w grows in magnitude, when it is multiplied by R.

    When the matrix R is Hermitian, the norm of R is easily obtained by using the results of (2.57) and (2.68). The result is

    $$\begin{array}{rcl} \|\bf{R}\| = {\lambda }_{\mathrm{max}}& & \end{array}$$
    (2.71)

    where λmax is the maximum eigenvalue of R.

    A common problem that we encounter in adaptive filtering is the solution of a system of linear equations such as

    $$\bf{R}\bf{w} = \bf{p}$$
    (2.72)

    In case there is an error in the vector p, originated by quantization or estimation, how does it affect the solution of the system of linear equations? For a positive definite Hermitian matrix R, it can be shown [22] that the relative error in the solution of the above linear system of equations is bounded by

    $$\frac{\|\Delta \bf{w}\|} {\|\bf{w}\|} \leq \frac{{\lambda }_{\mathrm{max}}} {{\lambda }_{\mathrm{min}}} \frac{\|\Delta \bf{p}\|} {\|\bf{p}\|}$$
    (2.73)

    where λmax and λmin are the maximum and minimum values of the eigenvalues of R, respectively. The ratio λmax ∕ λmin is called condition number of a matrix, that is

    $$C = \frac{{\lambda }_{\mathrm{max}}} {{\lambda }_{\mathrm{min}}} =\| \bf{R}\|\|{\bf{R}}^{-1}\|$$
    (2.74)

    The value of C influences the convergence behavior of a number of adaptive-filtering algorithms, as will be seen in the following chapters. Large value of C indicates that the matrix R is ill-conditioned, and that errors introduced by the manipulation of R may be largely amplified. When C = 1, the matrix is perfectly conditioned. In case R represents the correlation matrix of the input signal of an adaptive filter, with the input vector composed by uncorrelated elements of a delay line (see Fig. 2.1b, and the discussions around it), then C = 1.

Fig. 2.1
figure 1

(a) Linear combiner; (b) Adaptive FIR filter

Example 2.1.

Suppose the input signal vector is composed by a delay line with a single input signal, i.e.,

$$\bf{x}(k) = {[x(k)\:x(k - 1)\ldots x(k - N)]}^{T}$$

Given the following input signals:

  1. (a)
    $$x(k) = n(k)$$
  2. (b)
    $$x(k) = a\cos {\omega }_{0}k + n(k)$$
  3. (c)
    $$x(k) = \sum\limits_{i=0}^{M}{b}_{ i}n(k - i)$$
  4. (d)
    $$x(k) = -{a}_{1}x(k - 1) + n(k)$$
  5. (e)
    $$x(k) = a{\mathrm{e}}^{j({\omega }_{0}k+n(k))}$$

    where n(k) is a white noise with zero mean and variance σ n 2; in case (e) n(k) is uniformly distributed in the range − π to π.

Calculate the autocorrelation matrix R for N = 3.

Solution.

  1. (a)

    In this case, we have that \(E[x(k)x(k - l)] = {\sigma }_{n}^{2}\delta (l)\), where δ(l) denotes an impulse sequence. Therefore,

    $$\begin{array}{rcl} \bf{R}& =& E[\bf{x}(k){\bf{x}}^{T}(k)] = {\sigma }_{ n}^{2}\left [\begin{array}{cccc} 1&0&\cdots &0\\ 0 &1 &\cdots &0\\ \vdots & \vdots & \ddots & \vdots \\ 0&0&\cdots &1\\ \end{array} \right ] \end{array}$$
    (2.75)
  2. (b)

    In this example, n(k) is zero mean and uncorrelated with the deterministic cosine. The autocorrelation function can then be expressed as

    $$\begin{array}{rcl} r(k,k - l)& =& E[{a}^{2}\cos ({\omega }_{ 0}k)\cos ({\omega }_{0}k - {\omega }_{0}l) + n(k)n(k - l)] \\ & =& {a}^{2}E[\cos ({\omega }_{ 0}k)\cos ({\omega }_{0}k - {\omega }_{0}l)] + {\sigma }_{n}^{2}\delta (l) \\ & =& \frac{{a}^{2}} {2} [\cos ({\omega }_{0}l) +\cos (2{\omega }_{0}k - {\omega }_{0}l)] + {\sigma }_{n}^{2}\delta (l) \end{array}$$
    (2.76)

    where δ(l) again denotes an impulse sequence. Since part of the input signal is deterministic and nonstationary, the autocorrelation is time dependent.

    For the 3 ×3 case the input signal correlation matrix R(k) becomes

    $$\begin{array}{rcl} \frac{{a}^{2}} {2} \left [\begin{array}{ccc} 1 +\cos 2{\omega }_{0}k + \frac{2} {{a}^{2}} {\sigma }_{n}^{2} & \cos {\omega }_{0} +\cos {\omega }_{0}(2k - 1) & \cos 2{\omega }_{0} +\cos 2{\omega }_{0}(k - 1) \\ \cos {\omega }_{0} +\cos {\omega }_{0}(2k - 1) & 1 +\cos 2{\omega }_{0}(k - 1) + \frac{2} {{a}^{2}} {\sigma }_{n}^{2} & \cos {\omega }_{0} +\cos {\omega }_{0}(2(k - 1) - 1) \\ \cos 2{\omega }_{0} +\cos 2{\omega }_{0}(k - 1) & \cos {\omega }_{0} +\cos {\omega }_{0}(2(k - 1) - 1) & 1 +\cos 2{\omega }_{0}(k - 2) + \frac{2} {{a}^{2}} {\sigma }_{n}^{2}\ \ \ \ \ \end{array} \right ]& & \\ \end{array}$$
  3. (c)

    By exploring the fact that n(k) is a white noise, we can perform the following simplifications:

    $$\begin{array}{rcl} r(l)& =& E[x(k)x(k - l)] = E\left [\sum\limits_{j=0}^{M-l} \sum\limits_{i=0}^{M}{b}_{ i}{b}_{j}n(k - i)n(k - l - j)\right ] \\ & =& \sum\limits_{j=0}^{M-l}{b}_{ j}{b}_{l+j}E[{n}^{2}(k - l - j)] = {\sigma }_{ n}^{2} \sum\limits_{j=0}^{M}{b}_{ j}{b}_{l+j} \\ & & 0 \leq l + j \leq M \end{array}$$
    (2.77)

    where from the third to the fourth relation we used the fact that \(E[n(k - i)n(k - l - j)] = 0\) for il + j. For M = 3, the correlation matrix has the following form

    $$\begin{array}{rcl} \bf{R}& =& {\sigma }_{n}^{2}\left [\begin{array}{cccc} \sum\limits_{i=0}^{3}{b}_{ i}^{2} & \sum\limits_{i=0}^{2}{b}_{ i}{b}_{i+1} & \sum\limits_{i=0}^{1}{b}_{ i}{b}_{i+2} & {b}_{0}{b}_{3} \\ \sum\limits_{i=0}^{2}{b}_{ i}{b}_{i+1} & \sum\limits_{i=0}^{3}{b}_{ i}^{2} & \sum\limits_{i=0}^{2}{b}_{ i}{b}_{i+1} & \sum\limits _{i=0}^{1}{b}_{ i}{b}_{i+2} \\ \sum\limits_{i=0}^{1}{b}_{ i}{b}_{i+2} & \sum\limits_{i=0}^{2}{b}_{ i}{b}_{i+1} & \sum\limits_{i=0}^{3}{b}_{ i}^{2} & \sum\limits_{i=0}^{2}{b}_{ i}{b}_{i+1} \\ {b}_{0}{b}_{3} & \sum\limits_{i=0}^{1}{b}_{ i}{b}_{i+2} & \sum\limits_{i=0}^{2}{b}_{ i}{b}_{i+1} & \sum\limits_{i=0}^{3}{b}_{ i}^{2} \\ \end{array} \right ] \end{array}$$
    (2.78)
  4. (d)

    By solving the difference equation, we can obtain the correlation between x(k) and x(k − l), that is

    $$\begin{array}{rcl} x(k) = {(-{a}_{1})}^{l}x(k - l) +\sum\limits_{j=0}^{l-1}{(-{a}_{ 1})}^{j}n(k - j)& & \end{array}$$
    (2.79)

    Multiplying x(k − l) on both sides of the above equation and taking the expected value of the result, we obtain

    $$\begin{array}{rcl} E[x(k)x(k - l)] = {(-{a}_{1})}^{l}E[{x}^{2}(k - l)]& & \end{array}$$
    (2.80)

    since x(k − l) is independent of n(k − j) for j ≤ l − 1.

    For l = 0, just calculate x 2(k) and apply the expectation operation to the result. The partial result is

    $$\begin{array}{rcl} E[{x}^{2}(k)] = {a}_{ 1}^{2}E[{x}^{2}(k - 1)] + E[{n}^{2}(k)]& & \end{array}$$
    (2.81)

    therefore,

    $$\begin{array}{rcl} E[{x}^{2}(k)] = \frac{{\sigma }_{n}^{2}} {1 - {a}_{1}^{2}}& & \end{array}$$
    (2.82)

    assuming x(k) is WSS.

    The elements of R are then given by

    $$\begin{array}{rcl} r(l) = \frac{{(-{a}_{1})}^{\vert l\vert }} {1 - {a}_{1}^{2}} {\sigma }_{n}^{2}& & \end{array}$$
    (2.83)

    and the 3 ×3 autocorrelation matrix becomes

    $$\begin{array}{rcl} \bf{R}& =& \frac{{\sigma }_{n}^{2}} {1 - {a}_{1}^{2}}\left [\begin{array}{ccc} 1 & - {a}_{1} & {a}_{1}^{2} \\ - {a}_{1} & 1 & - {a}_{1} \\ {a}_{1}^{2} & - {a}_{1} & 1\end{array} \right ]\\ \end{array}$$
  5. (e)

    In this case, we are interested in calculating the autocorrelation of a complex sequence, that is

    $$\begin{array}{rcl} r(l)& =& E[x(k){x}^{{_\ast}}(k - l)] \\ & =& {a}^{2}E[{\mathrm{e}}^{-j(-{\omega }_{0}l-n(k)+n(k-l))}] \end{array}$$
    (2.84)

    By recalling the definition of expected value in (2.9), for l≠0,

    $$\begin{array}{rcl} r(l)& =& {a}^{2}{\mathrm{e}}^{j{\omega }_{0}l}{ \int \nolimits \nolimits }_{-\infty }^{\infty }{\int \nolimits \nolimits }_{-\infty }^{\infty }{\mathrm{e}}^{-j(-{n}_{0}+{n}_{1})}{p}_{ n(k),n(k-l)}({n}_{0},{n}_{1})d{n}_{0}d{n}_{1} \\ & =& {a}^{2}{\mathrm{e}}^{j{\omega }_{0}l}{ \int \nolimits \nolimits }_{-\pi }^{\pi }{ \int \nolimits \nolimits }_{-\pi }^{\pi }{\mathrm{e}}^{-j(-{n}_{0}+{n}_{1})}{p}_{ n(k)}({n}_{0}){p}_{n(k-l)}({n}_{1})d{n}_{0}d{n}_{1} \\ & =& {a}^{2}{\mathrm{e}}^{j{\omega }_{0}l}{ \int \nolimits \nolimits }_{-\pi }^{\pi }{ \int \nolimits \nolimits }_{-\pi }^{\pi }{\mathrm{e}}^{-j(-{n}_{0}+{n}_{1})} \frac{1} {2\pi } \frac{1} {2\pi }d{n}_{0}d{n}_{1} \\ & =& {a}^{2}{\mathrm{e}}^{j{\omega }_{0}l} \frac{1} {4{\pi }^{2}}{ \int \nolimits \nolimits }_{-\pi }^{\pi }{ \int \nolimits \nolimits }_{-\pi }^{\pi }{\mathrm{e}}^{-j(-{n}_{0}+{n}_{1})}d{n}_{ 0}d{n}_{1} \\ & =& {a}^{2}{\mathrm{e}}^{j{\omega }_{0}l} \frac{1} {4{\pi }^{2}}\left [{\int \nolimits \nolimits }_{-\pi }^{\pi }{\mathrm{e}}^{j{n}_{0} }d{n}_{0}\right ]\left [{\int \nolimits \nolimits }_{-\pi }^{\pi }{\mathrm{e}}^{-j{n}_{1} }d{n}_{1}\right ] \\ & =& {a}^{2}{\mathrm{e}}^{j{\omega }_{0}l} \frac{1} {4{\pi }^{2}}\left [\frac{{\mathrm{e}}^{j\pi } -{\mathrm{e}}^{-j\pi }} {j} \right ]\left [\frac{-{\mathrm{e}}^{-j\pi } +{ \mathrm{e}}^{j\pi }} {j} \right ] \\ & =& -{a}^{2}{\mathrm{e}}^{j{\omega }_{0}l} \frac{1} {{\pi }^{2}}(\sin \pi )(\sin \pi ) = 0 \end{array}$$
    (2.85)

    where in the fifth equality it is used the fact that n(k) and n(k − l), for l≠0, are independent.

    For l = 0

    $$\begin{array}{rcl} r(0)& =& E[x(k){x}^{{_\ast}}(k)] = {a}^{2}{\mathrm{e}}^{j({\omega }_{0}0)} = {a}^{2} \\ \end{array}$$

    Therefore,

    $$\begin{array}{rcl} r(l)& =& E[x(k){x}^{{_\ast}}(k - l)] = {a}^{2}{\mathrm{e}}^{j({\omega }_{0}l)}\delta (l) \\ \end{array}$$

    where in the 3 ×3 case

    $$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{ccc} {a}^{2} & 0 & 0 \\ 0 &{a}^{2} & 0 \\ 0 & 0 &{a}^{2}\end{array} \right ]\\ \end{array}$$

At the end it was verified the fact that when we have two exponential functions (l≠0) with uniformly distributed white noise in the range of − kπ to kπ as exponents, these exponentials are nonorthogonal only if l = 0, where k is a positive integer.

In the remaining part of this chapter and in the following chapters, we will treat the algorithms for real and complex signals separately. The derivations of the adaptive-filtering algorithms for complex signals are usually straightforward extensions of the real signal cases, and some of them are left as exercises.

2.4 Wiener Filter

One of the most widely used objective function in adaptive filtering is the MSE defined as

$$F[e(k)] = \xi (k) = E[{e}^{2}(k)] = E[{d}^{2}(k) - 2d(k)y(k) + {y}^{2}(k)]$$
(2.86)

where d(k) is the reference signal as illustrated in Fig. 1.1.

Suppose the adaptive filter consists of a linear combiner, i.e., the output signal is composed by a linear combination of signals coming from an array as depicted in Fig. 2.1a. In this case,

$$y(k) = \sum\limits_{i=0}^{N}{w}_{ i}(k){x}_{i}(k) ={ \bf{w}}^{T}(k)\bf{x}(k)$$
(2.87)

where x(k) = [x 0(k) x 1(k)…x N (k)]T and w(k) = [w 0(k) w 1(k)…w N (k)]Tare the input signal and the adaptive-filter coefficient vectors, respectively.

In many applications, each element of the input signal vector consists of a delayed version of the same signal, that is: \({x}_{0}(k) = x(k),{x}_{1}(k) = x(k - 1),\ldots,{x}_{N}(k) = x(k - N)\). Note that in this case the signal y(k) is the result of applying an FIR filterto the input signal x(k).

Since most of the analyses and algorithms presented in this book apply equally to the linear combiner and the FIR filter cases, we will mostly consider the latter case throughout the rest of the book. The main reason for this decision is that the fast algorithms for the recursive least-squares solution, to be discussed in the forthcoming chapters, explore the fact that the input signal vector consists of the output of a delay line witha single input signal, and, as a consequence, are not applicable to the linear combiner case.

The most straightforward realization for the adaptive filter is through the direct-form FIR structure as illustrated in Fig. 2.1b, with the output given by

$$y(k) = \sum\limits_{i=0}^{N}{w}_{ i}(k)x(k - i) ={ \bf{w}}^{T}(k)\bf{x}(k)$$
(2.88)

where \(\bf{x}(k) = {[x(k)\:x(k - 1)\ldots x(k - N)]}^{T}\) is the input vector representing a tapped-delay line, and w(k) = [w 0(k) w 1(k)…w N (k)]T is the tap-weight vector.

In both the linear combiner and FIR filter cases, the objective function can be rewritten as

$$\begin{array}{rcl} E[{e}^{2}(k)]& =& \xi (k) \\ & =& E\left [{d}^{2}(k) - 2d(k){\bf{w}}^{T}(k)\bf{x}(k) +{ \bf{w}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k)\bf{w}(k)\right ] \\ & =& E[{d}^{2}(k)] - 2E[d(k){\bf{w}}^{T}(k)\bf{x}(k)] + E[{\bf{w}}^{T}(k)\bf{x}(k){\bf{x}}^{T}(k)\bf{w}(k)]\qquad \quad \end{array}$$
(2.89)

For a filter with fixed coefficients, the MSE function in a stationary environment is given by

$$\begin{array}{rcl} \xi & =& E[{d}^{2}(k)] - 2{\bf{w}}^{T}E[d(k)\bf{x}(k)] +{ \bf{w}}^{T}E[\bf{x}(k){\bf{x}}^{T}(k)]\bf{w} \\ & =& E[{d}^{2}(k)] - 2{\bf{w}}^{T}\bf{p} +{ \bf{w}}^{T}\bf{R}\bf{w} \end{array}$$
(2.90)

where p = E[d(k)x(k)] is the cross-correlation vector between the desired and input signals,and R = E[x(k)x T(k)] is the input signal correlation matrix.As can be noted, the objective function ξ is a quadratic function of the tap-weight coefficients which would allow a straightforward solution for w that minimizes ξ, if vector p and matrix R are known. Note that matrix R corresponds to the Hessian matrix of the objective function defined in the previous chapter.

If the adaptive filter is implemented through an IIR filter, the objective function is a nonquadratic function of the filter parameters, turning the minimization problem into a much more difficult one. Local minima are likely to exist, rendering some solutions obtained by gradient-based algorithms unacceptable. Despite its disadvantages, adaptive IIR filters are needed in a number of applications where the order of a suitable FIR filter is too high. Typical applications include data equalization in communication channels and cancellation of acoustic echo, see Chap. 10.

The gradient vector of the MSE function related to the filter tap-weight coefficients is given byFootnote 7

$$\begin{array}{rcl}{ \mathbf{g}}_{\bf{w}}& =& \frac{\partial \xi } {\partial \bf{w}} ={ \left [ \frac{\partial \xi } {\partial {w}_{0}}\: \frac{\partial \xi } {\partial {w}_{1}}\ldots \frac{\partial \xi } {\partial {w}_{N}}\right ]}^{T} \\ & =& -2\bf{p} + 2\bf{R}\bf{w} \end{array}$$
(2.91)

By equating the gradient vector to zero and assuming R is nonsingular, the optimal values for the tap-weight coefficients that minimize the objective function can be evaluated as follows:

$${ \bf{w}}_{o} ={ \bf{R}}^{-1}\bf{p}$$
(2.92)

This solution is called the Wiener solution. Unfortunately, in practice, precise estimations of R and p are not available. When the input and the desired signals are ergodic, one is able to use time averages to estimate R and p, what is implicitly performed by most adaptive algorithms.

If we replace the optimal solution for w in the MSE expression, we can calculate the minimum MSE provided by the Wiener solution:

$$\begin{array}{rcl}{ \xi }_{\mathrm{min}}& =& E[{d}^{2}(k)] - 2{\bf{w}}_{ o}^{T}\bf{p} +{ \bf{w}}_{ o}^{T}\bf{R}{\bf{R}}^{-1}\bf{p} \\ & =& E[{d}^{2}(k)] -{\bf{w}}_{ o}^{T}\bf{p} \end{array}$$
(2.93)

The above equation indicates that the optimal set of parameters removes part of the power of the desired signal through the cross-correlation between x(k) and d(k), assuming both signals stationary. If the reference signal and the input signal are orthogonal, the optimal coefficients are equal to zero and the minimum MSE is E[d 2(k)]. This result is expected since nothing can be done with the parameters in order to minimize the MSE if the input signal carries no information about the desired signal. In this case, if any of the taps is nonzero, it would only increase the MSE.

An important property of the Wiener filter can be deduced if we analyze the gradient of the error surface at the optimal solution. The gradient vector can be expressed as follows:

$${ \mathbf{g}}_{\bf{w}} = \frac{\partial E[{e}^{2}(k)]} {\partial \bf{w}} = E[2e(k)\frac{\partial e(k)} {\partial \bf{w}} ] = -E[2e(k)\bf{x}(k)]$$
(2.94)

With the coefficients set at their optimal values, i.e., at the Wiener solution, the gradient vector is equal to zero, implying that

$$E[e(k)\bf{x}(k)] = \bf{0}$$
(2.95)

or

$$E[e(k)x(k - i)] = 0$$
(2.96)

for i = 0, 1, , N. This means that the error signal is orthogonal to the elements of the input signal vector. In case either the error or the input signal has zero mean, the orthogonality property implies that e(k) and x(k) are uncorrelated.

The orthogonality principle also applies to the correlation between the output signal y(k) and the error e(k), when the tap weights are given by w = w o . By premultiplying (2.95) by w o T, the desired result follows:

$$E[e(k){\bf{w}}_{o}^{T}\bf{x}(k)] = E[e(k)y(k)] = 0$$
(2.97)

The gradient with respect to a complex parameter has not been defined. For our purposes the complex gradient vector can be defined as [18]

$$\begin{array}{rcl}{ \mathbf{g}}_{\bf{w}(k)}\{F(e(k))\} = \frac{1} {2}\left \{ \frac{\partial F[e(k)]} {\partial \mathrm{re}[\bf{w}(k)]} - j \frac{\partial F[e(k)]} {\partial \mathrm{im}[\bf{w}(k)]}\right \}& & \\ \end{array}$$

where re[ ⋅] and im[ ⋅] indicate real and imaginary parts of [ ⋅], respectively. Note that the partial derivatives are calculated for each element of w(k).

For the complex case the error signal and the MSE are, respectively, described by, see Chap. 14 for details,

$$\begin{array}{rcl} e(k)& =& d(k) -{\bf{w}}^{H}(k)\bf{x}(k)\end{array}$$
(2.98)

and

$$\begin{array}{rcl} \xi & =& E[\vert e(k){\vert }^{2}] \\ & =& E[\vert d(k){\vert }^{2}] - 2\mathrm{re}\{{\bf{w}}^{H}E[{d}^{{_\ast}}(k)\bf{x}(k)]\} +{ \bf{w}}^{H}E[\bf{x}(k){\bf{x}}^{H}(k)]\bf{w} \\ & =& E[\vert d(k){\vert }^{2}] - 2\mathrm{re}[{\bf{w}}^{H}\bf{p}] +{ \bf{w}}^{H}\bf{R}\bf{w} \end{array}$$
(2.99)

where p = E[d  ∗ (k)x(k)] is the cross-correlation vector between the desired and input signals,and R = E[x(k)x H(k)] is the input signal correlation matrix.The Wiener solution in this case is also given by (2.92).

Example 2.2.

The input signal of a first-order adaptive filter is described by

$$\begin{array}{rcl} x(k) = {\alpha }_{1}{x}_{1}(k) + {\alpha }_{2}{x}_{2}(k)& & \\ \end{array}$$

where x 1(k) and x 2(k) are first-order AR processes and mutually uncorrelated having both unit variance. These signals are generated by applying distinct white noises to first-order filters whose poles are placed at − s 1 and − s 2, respectively.

  1. (a)

    Calculate the autocorrelation matrix of the input signal.

  2. (b)

    If the desired signal consists of x 2(k), calculate the Wiener solution.

Solution.

  1. (a)

    The models for the signals involved are described by

    $${x}_{i}(k) = -{s}_{i}{x}_{i}(k - 1) + {\kappa }_{i}{n}_{i}(k)$$

    for i = 1, 2. According to (2.83) the autocorrelation of either x i (k) is given by

    $$\begin{array}{rcl} E[{x}_{i}(k){x}_{i}(k - l)] = {\kappa }_{i}^{2}\frac{{(-{s}_{i})}^{\vert l\vert }} {1 - {s}_{i}^{2}} {\sigma }_{n,i}^{2}& & \end{array}$$
    (2.100)

    where σ n, i 2 is the variance of n i (k). Since each signal x i (k) has unit variance, then by applying l = 0 to the above equation

    $$\begin{array}{rcl}{ \kappa }_{i}^{2} = \frac{1 - {s}_{i}^{2}} {{\sigma }_{n,i}^{2}} & & \end{array}$$
    (2.101)

    Now by utilizing the fact that x 1(k) and x 2(k) are uncorrelated, the autocorrelation of the input signal is

    $$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{cc} {\alpha }_{1}^{2} + {\alpha }_{2}^{2} & - {\alpha }_{1}^{2}{s}_{1} - {\alpha }_{2}^{2}{s}_{2} \\ - {\alpha }_{1}^{2}{s}_{1} - {\alpha }_{2}^{2}{s}_{2} & {\alpha }_{1}^{2} + {\alpha }_{2}^{2}\end{array} \right ] \\ \bf{p}& =& \left [\begin{array}{c} {\alpha }_{2} \\ - {\alpha }_{2}{s}_{2}\end{array} \right ]\\ \end{array}$$
  2. (b)

    The Wiener solution can then be expressed as

    $$\begin{array}{rcl}{ \bf{w}}_{o}& =&{ \bf{R}}^{-1}\bf{p} \\ & =& \frac{1} {{({\alpha }_{1}^{2} + {\alpha }_{2}^{2})}^{2} - {({\alpha }_{1}^{2}{s}_{1} + {\alpha }_{2}^{2}{s}_{2})}^{2}}\left [\begin{array}{cc} {\alpha }_{1}^{2} + {\alpha }_{2}^{2} & {\alpha }_{1}^{2}{s}_{1} + {\alpha }_{2}^{2}{s}_{2} \\ {\alpha }_{1}^{2}{s}_{1} + {\alpha }_{2}^{2}{s}_{2} & {\alpha }_{1}^{2} + {\alpha }_{2}^{2}\end{array} \right ]\left [\begin{array}{c} {\alpha }_{2} \\ - {\alpha }_{2}{s}_{2}\end{array} \right ] \\ & =& \frac{1} {{(1 + \frac{{\alpha }_{2}^{2}} {{\alpha }_{1}^{2}} )}^{2} - {({s}_{1} + \frac{{\alpha }_{2}^{2}} {{\alpha }_{1}^{2}} {s}_{2})}^{2}}\left [\begin{array}{cc} 1 + \frac{{\alpha }_{2}^{2}} {{\alpha }_{1}^{2}} & {s}_{1} + \frac{{\alpha }_{2}^{2}} {{\alpha }_{1}^{2}} {s}_{2} \\ {s}_{1} + \frac{{\alpha }_{2}^{2}} {{\alpha }_{1}^{2}} {s}_{2} & 1 + \frac{{\alpha }_{2}^{2}} {{\alpha }_{1}^{2}}\end{array} \right ]\left [\begin{array}{c} \frac{{\alpha }_{2}} {{\alpha }_{1}^{2}} \\ - \frac{{\alpha }_{2}} {{\alpha }_{1}^{2}} {s}_{2}\end{array} \right ] \\ & =& {\alpha }_{2}\left [\begin{array}{cc} \frac{1} {{\alpha }_{1}^{2}+{\alpha }_{2}^{2}-{s}_{1}{\alpha }_{1}^{2}-{s}_{2}{\alpha }_{2}^{2}} & 0 \\ 0 & \frac{1} {{\alpha }_{1}^{2}+{\alpha }_{2}^{2}+{s}_{1}{\alpha }_{1}^{2}+{s}_{2}{\alpha }_{2}^{2}}\end{array} \right ]\left [\begin{array}{c} \frac{1-{s}_{2}} {2} \\ -\frac{1+{s}_{2}} {2}\end{array} \right ] \\ \end{array}$$

    Let’s assume that in this example our task was to detect the presence of x 2(k) in the input signal. For a fixed input-signal power, from this solution it is possible to observe that lower signal to interference at the input, that is lower \(\frac{{\alpha }_{2}^{2}} {{\alpha }_{1}^{2}}\), leads to a Wiener solution vector with lower norm. This result reflects the fact that the Wiener solution tries to detect the desired signal at the same time it avoids enhancing the undesired signal, i.e., the interference x 1(k).

2.5 Linearly Constrained Wiener Filter

In a number of applications, it is required to impose some linear constraints on the filter coefficients such that the optimal solution is the one that achieves the minimum MSE, provided the constraints are met. Typical constraints are: unity norm of the parameter vector; linear phase of the adaptive filter; prescribed gains at given frequencies.

In the particular case of an array of antennas the measured signals can be linearly combined to form a directional beam, where the signal impinging on the array in the desired direction will have higher gain. This application is called beamforming, where we specify gains at certain directions of arrival. It is clear that the array is introducing another dimension to the received data, namely spatial information. The weights in the antennas can be made adaptive leading to the so-called adaptive antenna arrays. This is the principle behind the concept of smart antennas, where a set of adaptive array processors filter the signals coming from the array, and direct the beam to several different directions where a potential communication is required. For example, in a wireless communication system we are able to form a beam for each subscriber according to its position, ultimately leading to minimization of noise from the environment and interference from other subscribers.

In order to develop the theory of linearly constrained optimal filters, let us consider the particular application of a narrowband beamformer required to pass without distortion all signals arriving at 90 ∘  with respect to the array of antennas. All other sources of signals shall be treated as interferers and must be attenuated as much as possible. Figure 2.2 illustrates the application. Note that in case the signal of interest does not impinge the array at 90 ∘  with respect to the array, a steering operation in the constraint vector c (to be defined) has to be performed [23].

Fig. 2.2
figure 2

Narrowband beamformer

The optimal filter that satisfies the linear constraints is called the linearly constrained minimum-variance (LCMV) filter.

If the desired signal source is sufficiently far from the array of antennas, then we may assume that the wavefronts are planar at the array. Therefore, the wavefront from the desired source will reach all antennas at the same instant, whereas the wavefront from the interferer will reach each antenna at different time instants. Taking the antenna with input signal x 0 as a time reference t 0, the wavefront will reach the ith antenna at [23]

$$ \begin{array}{rcl}{ t}_{i} = {t}_{0} + i\frac{d\cos \theta } {c} & & \\ \end{array} $$

where θ is the angle between the antenna array and the interferer direction of arrival, d is the distance between neighboring antennas, and c is the speed of propagation of the wave (3 ×108 m/s).

For this particular case, the LCMV filter is the one that minimizes the array output signal energy

$$\begin{array}{rcl} & & \xi = E[{y}^{2}(k)] = E[{\bf{w}}^{T}\bf{x}(k){\bf{x}}^{T}(k)\bf{w}]\:\: \\ & & \mathrm{subject\:to :}\ \ \ \ \ \sum\limits_{j=0}^{N}{c}_{ j}{w}_{j} = f \end{array}$$
(2.102)

where

$$\begin{array}{rcl} \bf{w}& =& {[{w}_{0}\:{w}_{1}\ldots {w}_{N}]}^{T} \\ \bf{x}(k)& =& {[{x}_{0}(k)\:{x}_{1}(k)\ldots {x}_{N}(k)]}^{T} \\ \end{array}$$

and

$$\begin{array}{rcl} \bf{c} = {[1\:1\ldots 1]}^{T}& & \\ \end{array}$$

is the constraint vector, since θ = 90 ∘ . The desired gain is usually f = 1.

In the case the desired signal impinges the array at an angle θ with respect to the array, the incoming signal reaches the ith antenna delayed by \(i\frac{d\cos \theta } {c}\) with respect to the 0th antenna [24]. Let’s consider the case of a narrowband array such that all antennas detect the impinging signal with the same amplitude when measured taking into consideration their relative delays, which are multiples of \(\frac{d\cos \theta } {c}\). In such a case the optimal receiver coefficients would be

$$\begin{array}{rcl}{ w}_{i} = \frac{{\mathrm{e}}^{j\omega {\tau }_{i}}} {N + 1}& &\end{array}$$
(2.103)

for i = 0, 1, , N, in order to add coherently the delays of the desired incoming signal at a given direction θ. The impinging signal appears at the ith antenna multiplied by e − jωτ i , considering the particular case of array configuration of Fig. 2.2. In this uniform linear array, the antenna locations are

$$\begin{array}{rcl}{ p}_{i} = id& & \\ \end{array}$$

for i = 0, 1, , N. Using the 0th antenna as reference, the signal will reach the array according to the following pattern

$$\begin{array}{rcl} \tilde{\bf{c}}& =&{ \mathrm{e}}^{j\omega t}{\left [1\:{\mathrm{e}}^{-j\omega \frac{d\cos \theta } {c} }\:{\mathrm{e}}^{-j\omega \frac{2d\cos \theta } {c} }\ldots {\mathrm{e}}^{-j\omega \frac{Nd\cos \theta } {c} }\right ]}^{T} \\ & =&{ \mathrm{e}}^{j\omega t}{\left [1\:{\mathrm{e}}^{-j\frac{2\pi } {\lambda } d\cos \theta }\:{\mathrm{e}}^{-j\frac{2\pi } {\lambda } 2d\cos \theta }\ldots {\mathrm{e}}^{-j\frac{2\pi } {\lambda } Nd\cos \theta }\right ]}^{T}\end{array}$$
(2.104)

where the equality \(\frac{\omega } {c} = \frac{2\pi } {\lambda }\) was employed, with λ being the wavelength corresponding to the frequency ω.

By defining the variable \(\psi (\omega,\theta ) = \frac{2\pi } {\lambda } d\cos \theta \), we can describe the output signal of the beamformer as

$$\begin{array}{rcl} y& =&{ \mathrm{e}}^{j\omega t} \sum\limits_{i=0}^{N}{w}_{ i}{\mathrm{e}}^{-j\psi (\omega,\theta )i} \\ & =&{ \mathrm{e}}^{j\omega t}H(\omega,\theta ) \end{array}$$
(2.105)

where H(ω, θ) modifies the amplitude and phase of transmitted signal at a given frequency ω. Note that the shaping function H(ω, θ) depends on the impinging angle.

For the sake of illustration, if the antenna separation is \(d = \frac{\lambda } {2}\), θ = 60 ∘ , and N is odd, then the constraint vector would be

$$\begin{array}{rcl} \bf{c}& =&{ \left [1\:\:\:{\mathrm{e}}^{-j\frac{\pi } {2} }\:\:\:{\mathrm{e}}^{-j\pi }\ldots {\mathrm{e}}^{-j\frac{N\pi } {2} }\right ]}^{T} \\ & =&{ \left [1\:\:\: - j\:\:\: - 1\ldots {\mathrm{e}}^{-j\frac{N\pi } {2} }\right ]}^{T}\end{array}$$
(2.106)

Using the method of Lagrange multipliers, we can rewrite the constrained minimization problem described in (2.102) as

$$\begin{array}{rcl}{ \xi }_{\mathrm{c}} = E[{\bf{w}}^{T}\bf{x}(k){\bf{x}}^{T}(k)\bf{w}] + \lambda ({\bf{c}}^{T}\bf{w} - f)& &\end{array}$$
(2.107)

The gradient of ξc with respect to w is equal to

$$\begin{array}{rcl}{ \mathbf{g}}_{\bf{w}} = 2\bf{R}\bf{w} + \lambda \bf{c}& &\end{array}$$
(2.108)

where R = E[x(k)x T(k)]. For a positive definite matrix R, the value of w that satisfies g w  = 0 is unique and minimizes ξ c . Denoting w o as the optimal solution, we have

$$\begin{array}{rcl} & & 2\bf{R}{\bf{w}}_{o} + \lambda \bf{c} = \bf{0} \\ & & 2{\bf{c}}^{T}{\bf{w}}_{ o} + \lambda {\bf{c}}^{T}{\bf{R}}^{-1}\bf{c} = \bf{0} \\ & & 2f + \lambda {\bf{c}}^{T}{\bf{R}}^{-1}\bf{c} = \bf{0} \\ \end{array}$$

where in order to obtain the second equality, we premultiply the first equation by c T R  − 1. Therefore,

$$\begin{array}{rcl} \lambda = -2{({\bf{c}}^{T}{\bf{R}}^{-1}\bf{c})}^{-1}f& & \\ \end{array}$$

and the LCMV filter is

$$\begin{array}{rcl}{ \bf{w}}_{o} ={ \bf{R}}^{-1}\bf{c}{({\bf{c}}^{T}{\bf{R}}^{-1}\bf{c})}^{-1}f& &\end{array}$$
(2.109)

If more constraints need to be satisfied by the filter, these can be easily incorporated in a constraint matrix and in a gain vector, such that

$$\begin{array}{rcl}{ \bf{C}}^{T}\bf{w} = \bf{f}& &\end{array}$$
(2.110)

In this case, the LCMV filter is given by

$$\begin{array}{rcl}{ \bf{w}}_{o} ={ \bf{R}}^{-1}\bf{C}{({\bf{C}}^{T}{\bf{R}}^{-1}\bf{C})}^{-1}\bf{f}& &\end{array}$$
(2.111)

If there is a desired signal, the natural objective is the minimization of the MSE, not the output energy as in the narrowband beamformer. In this case, it is straightforward to modify (2.107) and obtain the optimal solution

$$\begin{array}{rcl}{ \bf{w}}_{o} ={ \bf{R}}^{-1}\bf{p} +{ \bf{R}}^{-1}\bf{C}{({\bf{C}}^{T}{\bf{R}}^{-1}\bf{C})}^{-1}(\bf{f} -{\bf{C}}^{T}{\bf{R}}^{-1}\bf{p})& &\end{array}$$
(2.112)

where p = E[d(kx(k)], see Problem 20.

In the case of complex input signals and constraints, the optimal solution is given by

$$\begin{array}{rcl}{ \bf{w}}_{o} ={ \bf{R}}^{-1}\bf{p} +{ \bf{R}}^{-1}\bf{C}{({\bf{C}}^{H}{\bf{R}}^{-1}\bf{C})}^{-1}(\bf{f} -{\bf{C}}^{H}{\bf{R}}^{-1}\bf{p})& &\end{array}$$
(2.113)

where C H w = f.

2.5.1 The Generalized Sidelobe Canceller

An alternative implementation to the direct-form constrained adaptive filter showed above is called the generalized sidelobe canceller (GSC) (see Fig. 2.3) [25].

Fig. 2.3
figure 3

The generalized sidelobe canceller

For this structure the input signal vector is transformed by a matrix

$$\begin{array}{rcl} \bf{T} = [\bf{C}\ \bf{B}]& &\end{array}$$
(2.114)

where C is the constraint matrix and B is a blocking matrix that spans the null space of C, i.e., matrix B satisfies

$$\begin{array}{rcl}{ \bf{B}}^{T}\bf{C} = \bf{0}& &\end{array}$$
(2.115)

The output signal y(k) shown in Fig. 2.3 is formed as

$$\begin{array}{rcl} y(k)& =&{ \bf{w}}_{u}^{T}{\bf{C}}^{T}\bf{x}(k) +{ \bf{w}}_{ l}^{T}{\bf{B}}^{T}\bf{x}(k) \\ & =& {(\bf{C}{\bf{w}}_{u} + \bf{B}{\bf{w}}_{l})}^{T}\bf{x}(k) \\ & =& {(\bf{T}\bf{w})}^{T}\bf{x}(k) \\ & =&{ \bar{\bf{w}}}^{T}\bf{x}(k) \end{array}$$
(2.116)

where w = [w u T w l T]T and \(\bar{\bf{w}} = \bf{T}\bf{w}\).

The linear constraints are satisfied if \({\bf{C}}^{T}\bar{\bf{w}} = \bf{f}\). But as C T B = 0, then the condition to be satisfied becomes

$$\begin{array}{rcl}{ \bf{C}}^{T}\bar{\bf{w}} ={ \bf{C}}^{T}\bf{C}{\bf{w}}_{ u} = \bf{f}& &\end{array}$$
(2.117)

Therefore, for the GSC structure shown in Fig. 2.3 there is a necessary condition that the upper part of the coefficient vector, w u , should be initialized as

$$\begin{array}{rcl}{ \bf{w}}_{u} = {({\bf{C}}^{T}\bf{C})}^{-1}\bf{f}& &\end{array}$$
(2.118)

Minimization of the output energy is achieved with a proper choice of w l . In fact, we transformed a constrained optimization problem into an unconstrained one, which in turn can be solved with the classical linear Wiener filter, i.e.,

$$ \begin{array}{rcl} \min\limits_{{\bf{w}}_{l}}E[{y}^{2}(k)]& =& \min\limits_{{\bf{w}}_{l}}E\{{[{y}_{u}(k) +{ \bf{w}}_{l}^{T}{\bf{x}}_{ l}(k)]}^{2}\} \\ & =&{ \bf{w}}_{l,o} \\ & =& -{\bf{R}}_{l}^{-1}{\bf{p}}_{ l}, \end{array}$$
(2.119)

where

$$\begin{array}{rcl}{ \bf{R}}_{l}& =& E[{\bf{x}}_{l}(k){\bf{x}}_{l}^{T}(k)] \\ & =& E[{\bf{B}}^{T}\bf{x}(k){\bf{x}}^{T}(k)\bf{B}] \\ & =&{ \bf{B}}^{T}[\bf{x}(k){\bf{x}}^{T}(k)]\bf{B} \\ & =&{ \bf{B}}^{T}\bf{R}\bf{B} \end{array}$$
(2.120)

and

$$\begin{array}{rcl}{ \bf{p}}_{l}& =& E[{y}_{u}(k)\ {\bf{x}}_{l}(k)] = E[{\bf{x}}_{l}(k)\ {y}_{u}(k)] \\ & =& E[{\bf{B}}^{T}\bf{x}(k)\ {\bf{w}}_{ u}^{T}{\bf{C}}^{T}\bf{x}(k)] \\ & =& E[{\bf{B}}^{T}\bf{x}(k)\ {\bf{x}}^{T}(k)\bf{C}{\bf{w}}_{ u}] \\ & =&{ \bf{B}}^{T}E[\bf{x}(k)\ {\bf{x}}^{T}(k)]\bf{C}{\bf{w}}_{ u} \\ & =&{ \bf{B}}^{T}\bf{R}\bf{C}{\bf{w}}_{ u} \\ & =&{ \bf{B}}^{T}\bf{R}\bf{C}{({\bf{C}}^{T}\bf{C})}^{-1}\bf{f} \end{array}$$
(2.121)

where in the above derivations we utilized the results and definitions from (2.116) and (2.118).

Using (2.118), (2.120), and (2.121) it is possible to show that

$$\begin{array}{rcl}{ \bf{w}}_{l,o} = -{({\bf{B}}^{T}\bf{R}\bf{B})}^{-1}{\bf{B}}^{T}\bf{R}\bf{C}{({\bf{C}}^{T}\bf{C})}^{-1}\bf{f}& &\end{array}$$
(2.122)

Given that w l, o is the solution to an unconstrained minimization problem of transformed quantities, any unconstrained adaptive filter can be used to estimate recursively this optimal solution. The drawback in the implementation of the GSC structure comes from the transformation of the input signal vector via a constraint matrix and a blocking matrix. Although in theory any matrix with linearly independent columns that spans the null space of C can be employed, in many cases the computational complexity resulting from the multiplication of B by x(k) can be prohibitive. Furthermore, if the transformation matrix T is not orthogonal, finite-precision effects may yield an overall unstable system. A simple solution that guarantees orthogonality in the transformation and low computational complexity can be obtained with a Householder transformation [26].

2.6 MSE Surface

The MSE is a quadratic function of the parameters w. Assuming a given fixed w, the MSE is not a function of time and can be expressed as

$$\begin{array}{rcl} \xi = {\sigma }_{d}^{2} - 2{\bf{w}}^{T}\bf{p} +{ \bf{w}}^{T}\bf{R}\bf{w}& &\end{array}$$
(2.123)

where σ d 2 is the variance of d(k) assuming it has zero-mean. The MSE is a quadratic function of the tap weights forming a hyperparaboloid surface.The MSE surface is convex and has only positive values. For two weights, the surface is a paraboloid. Figure 2.4 illustrates the MSE surface for a numerical example where w has two coefficients. If the MSE surface is intersected by a plane parallel to the w plane, placed at a level superior to ξmin, the intersection consists of an ellipse representing equal MSE contoursas depicted in Fig. 2.5. Note that in this figure we showed three distinct ellipses, corresponding to different levels of MSE. The ellipses of constant MSE are all concentric. In order to understand the properties of the MSE surface, it is convenient to define a translated coefficient vector as follows:

$$\Delta \bf{w} = \bf{w} -{\bf{w}}_{o}$$
(2.124)
Fig. 2.4
figure 4

Mean-square error surface

Fig. 2.5
figure 5

Contours of the MSE surface

The MSE can be expressed as a function of Δw as follows:

$$\begin{array}{rcl} \xi & =& {\sigma }_{d}^{2} -{\bf{w}}_{ o}^{T}\bf{p} +{ \bf{w}}_{ o}^{T}\bf{p} - 2{\bf{w}}^{T}\bf{p} +{ \bf{w}}^{T}\bf{R}\bf{w} \\ & =& {\xi }_{\mathrm{min}} - \Delta {\bf{w}}^{T}\bf{p} -{\bf{w}}^{T}\bf{R}{\bf{w}}_{ o} +{ \bf{w}}^{T}\bf{R}\bf{w} \\ & =& {\xi }_{\mathrm{min}} - \Delta {\bf{w}}^{T}\bf{p} +{ \bf{w}}^{T}\bf{R}\Delta \bf{w} \\ & =& {\xi }_{\mathrm{min}} -{\bf{w}}_{o}^{T}\bf{R}\Delta \bf{w} +{ \bf{w}}^{T}\bf{R}\Delta \bf{w} \\ & =& {\xi }_{\mathrm{min}} + \Delta {\bf{w}}^{T}\bf{R}\Delta \bf{w} \end{array}$$
(2.125)

where we used the results of (2.92) and (2.93). The corresponding error surface contours are depicted in Fig. 2.6.

Fig. 2.6
figure 6

Translated contours of the MSE surface

By employing the diagonalized form of R, the last equation can be rewritten as follows:

$$\begin{array}{rcl} \xi & =& {\xi }_{\mathrm{min}} + \Delta {\bf{w}}^{T}\bf{Q}{\Lambda }{\bf{Q}}^{T}\Delta \bf{w} \\ & =& {\xi }_{\mathrm{min}} +{ \mathbf{v}}^{T}{\Lambda }\mathbf{v} \\ & =& {\xi }_{\mathrm{min}} + \sum\limits_{i=0}^{N}{\lambda }_{ i}{v}_{i}^{2}\end{array}$$
(2.126)

where v = Q TΔw are the rotated parameters.

The above form for representing the MSE surface is an uncoupled form, in the sense that each component of the gradient vector of the MSE with respect to the rotated parameters is a function of a single parameter, that is

$${\mathbf{g}}_{\mathbf{v}}[\xi ] = {[2{\lambda }_{0}{v}_{0}\:\:\:2{\lambda }_{1}{v}_{1}\:\ldots \:2{\lambda }_{N}{v}_{N}]}^{T}$$

This property means that if all v i ’s are zero except one, the gradient direction coincides with the nonzero parameter axis. In other words, the rotated parameters represent the principal axes of the hyperellipse of constant MSE, as illustrated in Fig. 2.7. Note that since the rotated parameters are the result of the projection of the original parameter vector Δw on the eigenvectors q i direction, it is straightforward to conclude that the eigenvectors represent the principal axes of the constant MSE hyperellipses.

Fig. 2.7
figure 7

Rotated contours of the MSE surface

The matrix of second derivatives of ξ as related to the rotated parameters is Λ. We can note that the gradient will be steeper in the principal axes corresponding to larger eigenvalues. This is the direction, in the two axes case, where the ellipse is narrow.

2.7 Bias and Consistency

The correct interpretation of the results obtained by the adaptive-filtering algorithm requires the definitions of bias and consistency. An estimate is considered unbiased if the following condition is satisfied

$$E[\mbox{w}(k)] ={ \mbox{w}}_{o}$$
(2.127)

The difference E[w(k)] − w o is called the bias in the parameter estimate.

An estimate is considered consistent if

$$\mbox{w}(k) \rightarrow {\mbox{w}}_{o}\:\:\mathrm{as}\:\:k \rightarrow \infty $$
(2.128)

Note that since w(k) is a random variable, it is necessary to define in which sense the limit is taken. Usually, the limit with probability one is employed. In the case of identification, a system is considered identifiable if the given parameter estimates are consistent. For a more formal treatment on this subject, refer to [21].

2.8 Newton Algorithm

In the context of the MSE minimization discussed in the previous section, see (2.123), the coefficient-vector updating using the Newton method is performed as follows:

$$\mbox{w}(k + 1) = \mbox{w}(k) - \mu {\mbox{R}}^{-1}{\mathbf{g}}_{\mbox{w}}(k)$$
(2.129)

where its derivation originates from (1.4). Assuming the true gradient and the matrix R are available, the coefficient-vector updating can be expressed as

$$\mbox{w}(k+1) = \mbox{w}(k)-\mu {\mbox{R}}^{-1}[-2\bf{p}+2\mbox{R}\mbox{w}(k)] = (\bf{I}-2\mu \bf{I})\mbox{w}(k)+2\mu {\mbox{w}}_{ o}$$
(2.130)

where if \(\mu = 1/2\), the Wiener solution is reached in one step.

The Wiener solution can be approached using a Newton-like search algorithm, by updating the adaptive-filter coefficients as follows:

$$\mbox{w}(k + 1) = \mbox{w}(k) - \mu \hat{{\mbox{R}}}^{-1}(k)\hat{{\mathbf{g}}}_{\mbox{w}}(k)$$
(2.131)

where \(\hat{{\bf{R}}}^{-1}(k)\) is an estimate of R  − 1 and \(\hat{{\mathbf{g}}}_{\mbox{w}}(k)\) is an estimate of g w , both at instant k. The parameter μis the convergence factor that regulates the convergence rate. Newton-based algorithms present, in general, fast convergence. However, the estimate of R  − 1 is computationally intensive and can become numerically unstable if special care is not taken. These factors made the steepest-descent-based algorithms more popular in adaptive-filtering applications.

2.9 Steepest-Descent Algorithm

In order to get a practical feeling of a problem that is being solved using the steepest-descent algorithm, we assume that the optimal coefficient vector, i.e., the Wiener solution, is w o , and that the reference signal is not corrupted by measurement noise.Footnote 8

The main objective of the present section is to study the rate of convergence, the stability, and the steady-state behavior of an adaptive filter whose coefficients are updated through the steepest-descent algorithm. It is worth mentioning that the steepest-descent method can be considered an efficient gradient-type algorithm, in the sense that it works with the true gradient vector, and not with an estimate of it. Therefore, the performance of othergradient-type algorithms can at most be close to the performance of the steepest-descent algorithm. When the objective function is the MSE, the difficult task of obtaining the matrix R and the vector p impairs the steepest-descent algorithm from being useful in adaptive-filtering applications. Its performance, however, serves as a benchmark for gradient-based algorithms.

The steepest-descent algorithm updates the coefficients in the following general form

$$\mbox{w}(k + 1) = \mbox{w}(k) - \mu {\mathbf{g}}_{\mbox{w}}(k)$$
(2.132)

where the above expression is equivalent to (1.6). It is worth noting that several alternative gradient-based algorithms available replace g w (k) by an estimate \(\hat{{\mathbf{g}}}_{\bf{w}}(k)\), and they differ in the way the gradient vector is estimated. The true gradient expression is given in (2.91) and, as can be noted, it depends on the vector p and the matrix R, that are usually not available.

Substituting (2.91) in (2.132), we get

$$\mbox{w}(k + 1) = \mbox{w}(k) - 2\mu \bf{R}\bf{w}(k) + 2\mu \bf{p}$$
(2.133)

Now, some of the main properties related to the convergence behavior of the steepest-descent algorithm in stationary environment are described. First, an analysis is required to determine the influence of the convergence factor μ in the convergence behavior of the steepest-descent algorithm.

The error in the adaptive-filter coefficients when compared to the Wiener solution is defined as

$$\Delta \mbox{w}(k) = \mbox{w}(k) -{\mbox{w}}_{o}$$
(2.134)

The steepest-descent algorithm can then be described in an alternative way, that is:

$$\begin{array}{rcl} \Delta \mbox{w}(k + 1)& =& \Delta \mbox{w}(k) - 2\mu [\bf{R}\mbox{w}(k) -\bf{R}{\mbox{w}}_{o}] \\ & =& \Delta \mbox{w}(k) - 2\mu \bf{R}\Delta \mbox{w}(k) \\ & =& \left (\bf{I} - 2\mu \bf{R}\right )\Delta \mbox{w}(k) \end{array}$$
(2.135)

where the relation p = Rw o (see (2.92)) was employed. It can be shown from the above equation that

$$\Delta \mbox{w}(k + 1) = {(\bf{I} - 2\mu \bf{R})}^{k+1}\Delta \mbox{w}(0)$$
(2.136)

or

$$\mbox{w}(k + 1) ={ \mbox{w}}_{o} + {(\bf{I} - 2\mu \bf{R})}^{k+1}[\mbox{w}(0) -{\mbox{w}}_{ o}]$$
(2.137)

The (2.135) premultiplied by Q T, where Q is the unitary matrix that diagonalizes R through a similarity transformation, yields

$$\begin{array}{rcl}{ \bf{Q}}^{T}\Delta \mbox{w}(k + 1)& =& (\bf{I} - 2\mu {\bf{Q}}^{T}\bf{R}\bf{Q}){\bf{Q}}^{T}\Delta \mbox{w}(k) \\ & =& \mathbf{v}(k + 1) \\ & =& (\bf{I} - 2\mu {\Lambda })\mathbf{v}(k) \\ & =& \left [\begin{array}{cccc} 1 - 2\mu {\lambda }_{0} & 0 &\cdots & 0 \\ 0 &1 - 2\mu {\lambda }_{1} & & \vdots\\ \vdots & \vdots & \ddots &\vdots \\ 0 & 0 & &1 - 2\mu {\lambda }_{N} \end{array} \right ]\mathbf{v}(k)\end{array}$$
(2.138)

In the above equation, \(\mathbf{v}(k + 1) ={ \bf{Q}}^{T}\Delta \mbox{w}(k + 1)\) is the rotated coefficient-vector error. Using induction, (2.138) can be rewritten as

$$\begin{array}{rcl} \mathbf{v}(k + 1)& =& {(\bf{I} - 2\mu {\Lambda })}^{k+1}\mathbf{v}(0) \\ & =& \left [\begin{array}{cccc} {(1 - 2\mu {\lambda }_{0})}^{k+1} & 0 &\cdots & 0 \\ 0 &{(1 - 2\mu {\lambda }_{1})}^{k+1} & & \vdots\\ \vdots & \vdots & \ddots &\vdots \\ 0 & 0 & &{(1 - 2\mu {\lambda }_{N})}^{k+1} \end{array} \right ]\mathbf{v}(0)\end{array}$$
(2.139)

This equation shows that in order to guarantee the convergence of the coefficients, each element 1 − 2μλ i must have an absolute value less than one. As a consequence, the convergence factor of the steepest-descent algorithm must be chosen in the range

$$0 < \mu < \frac{1} {{\lambda }_{\mathrm{max}}}$$
(2.140)

where λmax is the largest eigenvalue of R. In this case, all the elements of the diagonal matrix in (2.139) tend to zero as k → , resulting in v(k + 1) → 0 for large k.

The μ value in the above range guarantees that the coefficient vector approaches the optimum coefficient vector w o . It should be mentioned that if matrix R has large eigenvalue spread, the convergence speed of the coefficients will be primarily dependent on the value of the smallest eigenvalue. Note that the slowest decaying element in (2.139) is given by \({(1 - 2\mu {\lambda }_{\mathrm{min}})}^{k+1}\).

The MSE presents a transient behavior during the adaptation process that can be analyzed in a straightforward way if we employ the diagonalized version of R. Recalling from (2.125) that

$$\xi (k) = {\xi }_{\mathrm{min}} + \Delta {\mbox{w}}^{T}(k)\bf{R}\Delta \mbox{w}(k)$$
(2.141)

the MSE can then be simplified as follows:

$$\begin{array}{rcl} \xi (k)& =& {\xi }_{\mathrm{min}} + \Delta {\mbox{w}}^{T}(k)\bf{Q}{\Lambda }\ {\bf{Q}}^{T}\Delta \mbox{w}(k) \\ & =& {\xi }_{\mathrm{min}} +{ \mathbf{v}}^{T}(k){\Lambda }\ \mathbf{v}(k) \\ & =& {\xi }_{\mathrm{min}} + \sum\limits_{i=0}^{N}{\lambda }_{ i}{v}_{i}^{2}(k) \end{array}$$
(2.142)

If we apply the result of (2.139) in (2.142), it can be shown that the following relation results

$$\begin{array}{rcl} \xi (k)& =& {\xi }_{\mathrm{min}} +{ \mathbf{v}}^{T}(k - 1)(\bf{I} - 2\mu {\Lambda }){\Lambda }\ (\bf{I} - 2\mu {\Lambda })\mathbf{v}(k - 1) \\ & =& {\xi }_{\mathrm{min}} + \sum\limits_{i=0}^{N}{\lambda }_{ i}{(1 - 2\mu {\lambda }_{i})}^{2k}{v}_{ i}^{2}(0) \end{array}$$
(2.143)

The analyses presented in this section show that before the steepest-descent algorithm reaches the steady-state behavior, there is atransient period where the error is usually high and the coefficients are far from the Wiener solution. As can be seen from (2.139), in the case of the adaptive-filter coefficients, the convergence will follow (N + 1) geometric decaying curves with ratios\({r}_{wi} = (1 - 2\mu {\lambda }_{i})\). Each of these curves can be approximated by an exponential envelope with time constant τ wi as follows [5]:

$${r}_{wi} ={ \mathrm{e}}^{ \frac{-1} {{\tau }_{wi}} } = 1 - \frac{1} {{\tau }_{wi}} + \frac{1} {2!{\tau }_{wi}^{2}} + \cdots $$
(2.144)

In general, r wi is slightly smaller than one, specially in the cases of slowly decreasing modes that correspond to small values λ i and μ. Therefore,

$${r}_{wi} = (1 - 2\mu {\lambda }_{i}) \approx 1 - \frac{1} {{\tau }_{wi}}$$
(2.145)

then

$$\begin{array}{rcl}{ \tau }_{wi} \approx \frac{1} {2\mu {\lambda }_{i}}& & \\ \end{array}$$

for i = 0, 1, , N.

For the convergence of the MSE, the range of values of μ is the same to guarantee the convergence of the coefficients. In this case, due to the exponent 2k in (2.143), the geometric decaying curves have ratios given by \({r}_{ei} = (1 - 4\mu {\lambda }_{i})\), that can be approximated by exponential envelopes with time constants given by

$${\tau }_{ei} \approx \frac{1} {4\mu {\lambda }_{i}}$$
(2.146)

for i = 0, 1, , N, where it was considered that 4μ2λ i 2 ≪ 1. In the convergence of both the error and the coefficients, the time required for the convergence depends on the ratio of the eigenvalues of the input signal. Further discussions on convergence properties that apply to gradient-type algorithms can be found in Chap. 3.

Example 2.3.

The matrix R and the vector p are known for a given experimental environment:

$$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{cc} 1 &0.4045\\ 0.4045 & 1\\ \end{array} \right ]\\ \end{array}$$
$$\begin{array}{rcl} \bf{p}& =&{ \left [0\:\:0.2939\right ]}^{T} \\ \end{array}$$
$$\begin{array}{rcl} E[{d}^{2}(k)] = 0.5& & \\ \end{array}$$
  1. (a)

    Deduce the equation for the MSE.

  2. (b)

    Choose a small value for μ, and starting the parameters at \({[-1\:\: - 2]}^{T}\) plot the convergence path of the steepest-descent algorithm in the MSE surface.

  3. (c)

    Repeat the previous item for the Newton algorithm starting at [0   − 2]T.

Solution.

  1. (a)

    The MSE function is given by

    $$\begin{array}{rcl} \xi & =& E[{d}^{2}(k)] - 2{\bf{w}}^{T}\bf{p} +{ \bf{w}}^{T}\bf{R}\bf{w} \\ & =& {\sigma }_{d}^{2} - 2[{w}_{ 1}\:\:{w}_{2}]\left [\begin{array}{c} 0\\ 0.2939\\ \end{array} \right ] + [{w}_{1}\:\:{w}_{2}]\left [\begin{array}{cc} 1 &0.4045\\ 0.4045 & 1\\ \end{array} \right ]\left [\begin{array}{c} {w}_{1} \\ {w}_{2}\\ \end{array} \right ]\\ \end{array}$$

    After performing the algebraic calculations, we obtain the following result

    $$\begin{array}{rcl} \xi & =& 0.5 + {w}_{1}^{2} + {w}_{ 2}^{2} + 0.8090{w}_{ 1}{w}_{2} - 0.5878{w}_{2} \\ \end{array}$$
  2. (b)

    The steepest-descent algorithm was applied to minimize the MSE using a convergence factor \(\mu = 0.1/{\lambda }_{\mathrm{max}}\), where λmax = 1. 4045. The convergence path of the algorithm in the MSE surface is depicted in Fig. 2.8. As can be noted, the path followed by the algorithm first approaches the main axis (eigenvector) corresponding to the smaller eigenvalue, and then follows toward the minimum in a direction increasingly aligned with this main axis.

    Fig. 2.8
    figure 8

    Convergence path of the steepest-descent algorithm

  3. (c)

    The Newton algorithm was also applied to minimize the MSE using a convergence factor \(\mu = 0.1/{\lambda }_{\mathrm{max}}\). The convergence path of the Newton algorithm in the MSE surface is depicted in Fig. 2.9. The Newton algorithm follows a straight path to the minimum.

    Fig. 2.9
    figure 9

    Convergence path of the Newton algorithm

2.10 Applications Revisited

In this section, we give a brief introduction to the typical applications where the adaptive-filtering algorithms are required, including a discussion of where in the real world these applications are found. The main objective of this section is to illustrate how the adaptive-filtering algorithms, in general, and the ones presented in the book, in particular, are applied to solve practical problems. It should be noted that the detailed analysis of any particular application is beyond the scope of this book. Nevertheless, a number of specific references are given for the interested reader. The distinctive feature of each application is the way the adaptive filter input signal and the desired signal are chosen. Once these signals are determined, any known properties of them can be used to understand the expected behavior of the adaptive filter when attempting to minimize the chosen objective function (for example, the MSE, ξ).

2.10.1 System Identification

The typical setup of the system identification application is depicted in Fig. 2.10. A common input signal is applied to the unknown system and to the adaptive filter. Usually, the input signal is a wideband signal, in order to allow the adaptive filter to converge to a good model of the unknown system.

Fig. 2.10
figure 10

System identification

Assume the unknown system has impulse response given by h(k), for k = 0, 1, 2, 3, , , and zero for k < 0. The error signal is then given by

$$\begin{array}{rcl} e(k)& =& d(k) - y(k) \\ & =& \sum\limits_{l=0}^{\infty }h(l)x(k - l) -\sum\limits_{i=0}^{N}{w}_{ i}(k)x(k - i)\end{array}$$
(2.147)

where w i (k) are the coefficients of the adaptive filter.

Assuming that x(k) is a white noise, the MSE for a fixed w is given by

$$\begin{array}{rcl} \xi & =& E\{{[{\mathbf{h}}^{T}{\bf{x}}_{ \infty }(k) -{\bf{w}}^{T}{\bf{x}}_{ N+1}(k)]}^{2}\} \\ & =& E\left [{\mathbf{h}}^{T}{\bf{x}}_{ \infty }(k){\bf{x}}_{\infty }^{T}(k)\mathbf{h} - 2{\mathbf{h}}^{T}{\bf{x}}_{ \infty }(k){\bf{x}}_{N+1}^{T}(k)\bf{w} +{ \bf{w}}^{T}{\bf{x}}_{ N+1}(k){\bf{x}}_{N+1}^{T}(k)\bf{w}\right ] \\ & =& {\sigma }_{x}^{2}{ \sum \nolimits }_{i=0}^{\infty }{h}^{2}(i) - 2{\sigma }_{ x}^{2}{\mathbf{h}}^{T}\left [\begin{array}{c} {\bf{I}}_{N+1} \\ {\bf{0}}_{\infty \times (N+1)} \end{array} \right ]\bf{w} +{ \bf{w}}^{T}{\bf{R}}_{ N+1}\bf{w}\end{array}$$
(2.148)

where x (k) and x N + 1(k) are the input signal vector with infinite and finite lengths, respectively.

By calculating the derivative of ξ with respect to the coefficients of the adaptive filter, it follows that

$$\begin{array}{rcl}{ \bf{w}}_{o} ={ \mathbf{h}}_{N+1}& &\end{array}$$
(2.149)

where

$$\begin{array}{rcl}{ \mathbf{h}}_{N+1}^{T} ={ \mathbf{h}}^{T}\left [\begin{array}{c} {\bf{I}}_{N+1} \\ {\bf{0}}_{\infty \times (N+1)} \end{array} \right ]& &\end{array}$$
(2.150)

If the input signal is a white noise, the best model for the unknown system is a system whose impulse response coincides with the N + 1 first samples of the unknown system impulse response. In the cases where the impulse response of the unknown system is of finite length and the adaptive filter is of sufficient order (i.e., it has enough number of parameters), the MSE becomes zero if there is no measurement noise (or channel noise).In practical applications the measurement noise is unavoidable, and if it is uncorrelated with the input signal, the expected value of the adaptive-filter coefficients will coincide with the unknown-system impulse response samples. The output error will of course be the measurement noise. We can observe that the measurement noise introduces a variance in the estimates of the unknown system parameters.

Some real world applications of the system identification scheme include modeling of multipath communication channels [27], control systems [28], seismic exploration [29], and cancellation of echo caused by hybrids in some communication systems [3034], just to mention a few.

2.10.2 Signal Enhancement

In the signal enhancement application, the reference signal consists of a desired signal x(k) that is corrupted by an additive noise n 1(k). The input signal of the adaptive filter is a noise signal n 2(k) that is correlated with the interference signal n 1(k), but uncorrelated with x(k). Figure 2.11 illustrates the configuration of the signal enhancement application. In practice, this configuration is found in acoustic echo cancellation for auditoriums [35], hearing aids, noise cancellation in hydrophones [36], cancelling of power line interference in electrocardiography [28], and in other applications. The cancelling of echo caused by the hybrid in some communication systems can also be considered a signal enhancement problem [28].

Fig. 2.11
figure 11

Signal enhancement (n 1(k) and n 2(k) are noise signals correlated with each other)

In this application, the error signal is given by

$$e(k) = x(k) + {n}_{1}(k) -\sum\limits_{l=0}^{N}{w}_{ l}{n}_{2}(k - l) = x(k) + {n}_{1}(k) - y(k)$$
(2.151)

The resulting MSE is then given by

$$E[{e}^{2}(k)] = E[{x}^{2}(k)] + E\{{[{n}_{ 1}(k) - y(k)]}^{2}\}$$
(2.152)

where it was assumed that x(k) is uncorrelated with n 1(k) and n 2(k). The above equation shows that if the adaptive filter, having n 2(k) as the input signal, is able to perfectly predict the signal n 1(k), the minimum MSE is given by

$${\xi }_{\mathrm{min}} = E[{x}^{2}(k)]$$
(2.153)

where the error signal, in this situation, is the desired signal x(k).

The effectiveness of the signal enhancement scheme depends on the high correlation between n 1(k) and n 2(k). In some applications, it is useful to include a delay of L samples in the reference signal or in the input signal, such that their relative delay yields a maximum cross-correlation between y(k) and n 1(k), reducing the MSE. This delay provides a kind of synchronization between the signals involved. An example exploring this issue will be presented in the following chapters.

2.10.3 Signal Prediction

In the signal prediction application, the adaptive-filter input consists of a delayed version of the desired signal as illustrated in Fig. 2.12. The MSE is given by

$$\xi = E\{{[x(k) -{\bf{w}}^{T}\bf{x}(k - L)]}^{2}\}$$
(2.154)
Fig. 2.12
figure 12

Signal prediction

The minimization of the MSE leads to an FIR filter, whose coefficients are the elements of w. This filter is able to predict the present sample of the input signal using as information old samples such as \(x(k - L),\:x(k - L - 1),\ldots,\:x(k - L - N)\). The resulting FIR filter can then be considered a model for the signal x(k) when the MSE is small. The minimum MSE is given by

$${ \xi }_{\mathrm{min}} = r(0)-{\bf{w}}_{o}^{T}\left [\begin{array}{c} r(L) \\ r(L + 1)\\.\\.\\. \\ r(L + N) \end{array} \right ]$$
(2.155)

where w o is the optimum predictor coefficient vector and \(r(l) = E[x(k)x(k - l)]\) for a stationary process.

A typical predictor’s application is in linear prediction coding of speech signals [37], where the predictor’s task is to estimate the speech parameters. These parameters w are part of the coding information that is transmitted or stored along with other information inherent to the speech characteristics, such as pitch period, among others.

The adaptive signal predictor is also used for adaptive line enhancement (ALE), where the input signal is a narrowband signal (predictable) added to a wideband signal. After convergence, the predictor output will be an enhanced version of the narrowband signal.

Yet another application of the signal predictor is the suppression of narrowband interference in a wideband signal. The input signal, in this case, has the same general characteristics of the ALE. However, we are now interested in removing the narrowband interferer. For such an application, the output signal of interest is the error signal [35].

2.10.4 Channel Equalization

As can be seen from Fig. 2.13, channel equalization or inverse filtering consists of estimating a transfer function to compensate for the linear distortion caused by the channel. From another point of view, the objective is to force a prescribed dynamic behavior for the cascade of the channel (unknown system) and the adaptive filter, determined by the input signal. The first interpretation is more appropriate in communications, where the information is transmitted through dispersive channels [3833]. The second interpretation is appropriate for control applications, where the inverse filtering scheme generates control signals to be used in the unknown system [28].

Fig. 2.13
figure 13

Channel equalization

In the ideal situation, where n(k) = 0 and the equalizer has sufficient order, the error signal is zero if

$$W(z)H(z) = {z}^{-L}$$
(2.156)

where W(z) and H(z) are the equalizer and unknown system transfer functions, respectively. Therefore, the ideal equalizer has the following transfer function

$$W(z) = \frac{{z}^{-L}} {H(z)}$$
(2.157)

From the above equation, we can conclude that if H(z) is an IIR transfer function with nontrivial numerator and denominator polynomials, W(z) will also be IIR. If H(z) is an all-pole model, W(z) is FIR. If H(z) is an all-zero model, W(z) is an all-pole transfer function.

By applying the inverse \(\mathcal{Z}\)-transform to (2.156), we can conclude that the optimal equalizer impulse response convolved withthe channel impulse response produces as a result an impulse. This means that for zero additional error in the channel, the output signal y(k) restores x(k − L) and, therefore, one can conclude that a deconvolution process took place.

The delay in the reference signal plays an important role in the equalization process. Without the delay, the desired signal is x(k), whereas the signal y(k) will be mainly influenced by old samples of the input signal, since the unknown system is usually causal. As a consequence, the equalizer should also perform the task of predicting x(k) simultaneously with the main task of equalizing the channel. The introduction of a delay alleviates the prediction task, leaving the equalizer free to invert the channel response. A rule of thumb for choosing the delay was proposed and analyzed in [28], where it was conjectured that the best delay should be close to half the time span of the equalizer. In practice, the reader should try different delays.

In the case the unknown system is not of minimum phase, i.e., its transfer function has zeros outside the unit circle of the \(\mathcal{Z}\) plane, the optimum equalizer is either stable and noncausal, or unstable and causal. Both solutions are unacceptable. The noncausal stable solution could be better approximated by a causal FIR filter when the delay is included in the desired signal. The delay forces a time shift in the ideal impulse response of the equalizer, allowing the time span, where most of the energy is concentrated, to be in the causal region.

If channel noise signal is present and is uncorrelated with the channel’s input signal, the error signal and y(k) will be accordingly noisier. However, it should be noticed that the adaptive equalizer, in the process of reducing the MSE, disturbs the optimal solution by trying to reduce the effects of n(k). Therefore, in a noisy environment the equalizer transfer function is not exactly the inverse of H(z).

In practice, the noblest use of the adaptive equalizer is to compensate for the distortion caused by the transmission channel in a communication system. The main distortions caused by the channels are high attenuation and intersymbol interference (ISI).The ISI is generated when different frequency components of the transmitted signals arrive at different times at the receiver, a phenomenon caused by the nonlinear group delay of the channel [38].For example, in a digital communication system, the time-dispersive channel extends a transmitted symbol beyond the time interval allotted to it, interfering in the past and future symbols. Under severe ISI, when short symbol space is used, the number of symbols causing ISI is large.

The channel impulse response is a time spread sequence described by h(k) with the received signal being given by

$$re(k + J) = x(k)h(J) + \sum\limits_{l=-\infty,\:l\neq k}^{k+J}x(l)h(k + J - l) + n(k + J)$$
(2.158)

where J denotes the channel time delay (including the sampler phase). The first term of the above equation corresponds to the desired information, the second term is the interference of the symbols sent before and after x(k). The third term accounts for channel noise. Obviously only the neighboring symbols have significant influence in the second term of the above equation. The elements of the second term involving x(l), for l > k, are called pre-cursor ISI since they are caused by components of the data signal that reach the receiver before their cursor. On the other hand, the elements involving x(l), for l < k, are called post-cursor ISI.

In many situations, the ISI is reduced by employing an equalizer consisting of an adaptive FIR filter of appropriate length. The adaptive equalizer attempts to cancel the ISI in the presence of noise. In digital communication, a decision device is placed after the equalizer in order to identify the symbol at a given instant. The equalizer coefficients are updated in two distinct circumstances by employing different reference signals. During the equalizer training period, a previously chosen training signal is transmitted through the channel and a properly delayed version of this signal, that is prestored in the receiver end, is used as reference signal. The training signal is usually a pseudo-noise sequence long enough to allow the equalizer to compensate for the channel distortions. After convergence, the error between the adaptive-filter output and the decision device output is utilized to update the coefficients. The resulting scheme is the decision-directed adaptive equalizer. It should be mentioned that in some applications no training period is available. Usually, in this case, the decision-directed error is used all the time.

A more general equalizer scheme is the decision-feedback equalizer (DFE) illustrated in Fig. 2.14. The DFE is widely used in situations where the channel distortion is severe [3839]. The basic idea is to feed back, via a second FIR filter, the decisions made by the decision device that is applied to the equalized signal. The second FIR filter is preceded by a delay, otherwise there is a delay-free loop around the decision device. Assuming the decisions were correct, we are actually feeding back the symbols x(l), for l < k, of (2.158). The DFE is able to cancel the post-cursor ISI for a number of past symbols (depending on the order of the FIR feedback filter), leaving more freedom for the feedforward section to take care of the remaining terms of the ISI. Some known characteristics of the DFE are [38]:

  • The signals that are fed back are symbols, being noise free and allowing computational savings.

  • The noise enhancement is reduced, if compared with the feedforward-only equalizer.

  • Short time recovery when incorrect decisions are made.

  • Reduced sensitivity to sampling phase.

Fig. 2.14
figure 14

Decision-feedback equalizer

The DFE operation starts with a training period where a known sequence is transmitted through the channel, and the same sequence is used at the receiver as the desired signal. The delay introduced in the training signal is meant to compensate for the delay the transmitted signal faces when passing through the channel. During the training period the error signal, which consists of the difference between the delayed training signal and signal y(k), is minimized by adapting the coefficients of the forward and feedback filters. After this period, there is no training signal and the desired signal will consist of the decision device output signal. Assuming the decisions are correct, this blind way of performing the adaptation is the best solution to keep track of small changes in the channel behavior.

Example 2.4.

In this example we will verify the effectiveness of the Wiener solution in environments related to the applications of noise cancellation, prediction, equalization, and identification.

  1. (a)

    In a noise cancellation environment a sinusoid is corrupted by noise as follows

    $$\begin{array}{rcl} d(k) =\cos {\omega }_{0}k + {n}_{1}(k)& & \\ \end{array}$$

    with

    $$\begin{array}{rcl}{ n}_{1}(k) = -a{n}_{1}(k - 1) + n(k)& & \\ \end{array}$$

     | a |  < 1 and n(k) is a zero-mean white noise with variance σ n 2 = 1. The input signal of the Wiener filter is described by

    $$\begin{array}{rcl}{ n}_{2}(k) = -b{n}_{2}(k - 1) + n(k)& & \\ \end{array}$$

    where | b |  < 1.

  2. (b)

    In a prediction case the input signal is modeled as

    $$\begin{array}{rcl} x(k) = -ax(k - 1) + n(k)& & \\ \end{array}$$

    with n(k) being a white noise with unit variance and | a |  < 1.

  3. (c)

    In an equalization problem a zero-mean white noise signal s(k) with variance c is transmitted through a channel with an AR model described by

    $$\begin{array}{rcl} \hat{x}(k) = -a\hat{x}(k - 1) + s(k)& & \\ \end{array}$$

    with | a |  < 1 and the received signal given by

    $$\begin{array}{rcl} x(k) =\hat{ x}(k) + n(k)& & \\ \end{array}$$

    whereas n(k) is a zero-mean white noise with variance d and uncorrelated with s(k).

  4. (d)

    In a system identification problem a zero-mean white noise signal x(k) with variance c is employed as the input signal to identify an AR system whose model is described by

    $$\begin{array}{rcl} v(k) = -av(k - 1) + x(k)& & \\ \end{array}$$

    where | a |  < 1 and the desired signal is given by

    $$\begin{array}{rcl} d(k) = v(k) + n(k)& & \\ \end{array}$$

    Repeat the problem if the system to be identified is an MA whose model is described by

    $$\begin{array}{rcl} v(k) = -ax(k - 1) + x(k)& & \\ \end{array}$$

For all these cases describe the Wiener solution with two coefficients and comment on the results.

Solution.

Some results used in the examples are briefly reviewed. A 2 ×2 matrix inversion is performed as

$$\begin{array}{rcl}{ \bf{R}}^{-1}& =& \frac{1} {{r}_{11}{r}_{22} - {r}_{12}{r}_{21}}\left [\begin{array}{cc} {r}_{22} & - {r}_{12} \\ - {r}_{21} & {r}_{11}\\ \end{array} \right ]\\ \end{array}$$

where r ij is the element of row i and column j of the matrix R. For two first-order AR modeled signals x(k) and v(k), whose poles are, respectively, placed at − a and − b with the same white-noise input with unit variance, their cross-correlations are given byFootnote 9

$$\begin{array}{rcl} E[x(k)v(k - l)] = \frac{{(-a)}^{l}} {1 - ab}& & \\ \end{array}$$

for l > 0, and

$$\begin{array}{rcl} E[x(k)v(k - l)] = \frac{{(-b)}^{-l}} {1 - ab} & & \\ \end{array}$$

for l < 0, are frequently required in the following solutions.

  1. (a)

    The input signal in this case is given by n 2(k), whereas the desired signal is given by d(k). The elements of the correlation matrix are computed as

    $$\begin{array}{rcl} E[{n}_{2}(k){n}_{2}(k - l)] = \frac{{(-b)}^{\vert l\vert }} {1 - {b}^{2}} & & \\ \end{array}$$

    The expression for the cross-correlation vector is given by

    $$\begin{array}{rcl} \bf{p}& =& \left [\begin{array}{c} E[(\cos {\omega }_{0}k + {n}_{1}(k)){n}_{2}(k)] \\ E[(\cos {\omega }_{0}k + {n}_{1}(k)){n}_{2}(k - 1)]\end{array} \right ] \\ & =& \left [\begin{array}{c} E[{n}_{1}(k){n}_{2}(k)] \\ E[{n}_{1}(k){n}_{2}(k - 1)]\end{array} \right ] \\ & =& \left [\begin{array}{c} \frac{1} {1-ab}{\sigma }_{n}^{2} \\ - \frac{a} {1-ab}{\sigma }_{n}^{2} \end{array} \right ] = \left [\begin{array}{c} \frac{1} {1-ab} \\ - \frac{a} {1-ab}\end{array} \right ]\\ \end{array}$$

    where in the last expression we substituted σ n 2 = 1.

    The coefficients corresponding to the Wiener solution are given by

    $$\begin{array}{rcl}{ \bf{w}}_{o}& =&{ \bf{R}}^{-1}\bf{p} = \left [\begin{array}{cc} 1& b \\ b&1\end{array} \right ]\left [\begin{array}{c} \frac{1} {1-ab} \\ - \frac{a} {1-ab}\end{array} \right ] = \left [\begin{array}{c} 1 \\ \frac{b-a} {1-ab}\end{array} \right ]\\ \end{array}$$

    The special case where a = 0 provides a quite illustrative solution. In this case

    $$\begin{array}{rcl}{ \bf{w}}_{o}& =& \left [\begin{array}{c} 1\\ b\end{array} \right ]\\ \end{array}$$

    such that the error signal is given by

    $$\begin{array}{rcl} e(k)& =& d(k) - y(k) =\cos {\omega }_{0}k + n(k) -{\bf{w}}_{o}^{T}\left [\begin{array}{c} {n}_{2}(k) \\ {n}_{2}(k - 1)\end{array} \right ] \\ & =& \cos {\omega }_{0}k + n(k) - {n}_{2}(k) - b{n}_{2}(k - 1) \\ & =& \cos {\omega }_{0}k + n(k) + b{n}_{2}(k - 1) - n(k) - b{n}_{2}(k - 1) =\cos {\omega }_{0}k \\ \end{array}$$

    As can be observed the cosine signal is fully recovered since the Wiener filter was able to restore n(k) and remove it from the desired signal.

  2. (b)

    In the prediction case the input signal is x(k) and the desired signal is x(k + L). Since

    $$\begin{array}{rcl} E[x(k)x(k - L)] = \frac{{(-a)}^{\vert L\vert }} {1 - {a}^{2}} & & \\ \end{array}$$

    the input signal correlation matrix is

    $$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{cc} E[{x}^{2}(k)] &E[x(k)x(k - 1)] \\ E[x(k)x(k - 1)]& E[{x}^{2}(k - 1)]\\ \end{array} \right ] \\ & =& \left [\begin{array}{cc} \frac{1} {1-{a}^{2}} & - \frac{a} {1-{a}^{2}} \\ - \frac{a} {1-{a}^{2}} & \frac{1} {1-{a}^{2}}\end{array} \right ]\\ \end{array}$$

    Vector p is described by

    $$\begin{array}{rcl} \bf{p}& =& \left [\begin{array}{c} E[x(k + L)x(k)]\\ E[x(k + L)x(k - 1)] \end{array} \right ] = \left [\begin{array}{c} \frac{{(-a)}^{\vert L\vert }} {1-{a}^{2}} \\ \frac{{(-a)}^{\vert L+1\vert }} {1-{a}^{2}}\end{array} \right ]\\ \end{array}$$

    The expression for the optimal coefficient vector is easily derived.

    $$\begin{array}{rcl}{ \bf{w}}_{o}& =&{ \bf{R}}^{-1}\bf{p} \\ & =& (1 - {a}^{2})\left [\begin{array}{cc} \frac{1} {1-{a}^{2}} & \frac{a} {1-{a}^{2}} \\ \frac{a} {1-{a}^{2}} & \frac{1} {1-{a}^{2}}\end{array} \right ]\left [\begin{array}{c} \frac{{(-a)}^{L}} {1-{a}^{2}} \\ \frac{{(-a)}^{L+1}} {1-{a}^{2}}\end{array} \right ] \\ & =& \left [\begin{array}{c} {(-a)}^{L} \\ 0\end{array} \right ] \\ \end{array}$$

    where in the above equation the value of L is considered positive. The predictor result tells us that an estimate \(\hat{x}(k + L)\) of x(k + L) can be obtained as

    $$\begin{array}{rcl} \hat{x}(k + L) = {(-a)}^{L}x(k)& & \\ \end{array}$$

    According to our model for the signal x(k), the actual value of x(k + L) is

    $$\begin{array}{rcl} x(k + L) = {(-a)}^{L}x(k) +{ \sum \nolimits }_{i=0}^{L-1}{(-a)}^{i}n(k - i)& & \\ \end{array}$$

    The results show that if x(k) is an observed data at a given instant of time, the best estimate of x(k + L) in terms of x(k) is to average out the noise as follows

    $$\begin{array}{rcl} \hat{x}(k + L) = {(-a)}^{L}x(k) + E\left [{\sum \nolimits }_{i=0}^{L-1}{(-a)}^{i}n(k - i)\right ] = {(-a)}^{L}x(k)& & \\ \end{array}$$

    since \(E[n(k - i)] = 0\).

  3. (c)

    In this equalization problem, matrix R is given by

    $$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{cc} E[{x}^{2}(k)] &E[x(k)x(k - 1)] \\ E[x(k)x(k - 1)]& E[{x}^{2}(k - 1)]\end{array} \right ] = \left [\begin{array}{cc} \frac{1} {1-{a}^{2}} c + d& - \frac{a} {1-{a}^{2}} c \\ - \frac{a} {1-{a}^{2}} c & \frac{1} {1-{a}^{2}} c + d\end{array} \right ]\\ \end{array}$$

    By utilizing as desired signal s(k − L) and recalling that it is a white noise and uncorrelated with the other signals involved in the experiment, the cross-correlation vector between the input and desired signals has the following expression

    $$\begin{array}{rcl} \bf{p}& =& \left [\begin{array}{c} E[x(k)s(k - L)]\\ E[x(k - 1)s(k - L)] \end{array} \right ] = \left [\begin{array}{c} {(-1)}^{L}{a}^{L}c \\ {(-1)}^{L-1}{a}^{L-1}c\end{array} \right ]\\ \end{array}$$

    The coefficients of the underlying Wiener solution are given by

    $$\begin{array}{rcl}{ \bf{w}}_{o}& =&{ \bf{R}}^{-1}\bf{p} = \frac{1} { \frac{{c}^{2}} {1-{a}^{2}} + 2 \frac{dc} {1-{a}^{2}} + {d}^{2}}\left [\begin{array}{cc} \frac{1} {1-{a}^{2}} c + d& \frac{a} {1-{a}^{2}} c \\ \frac{a} {1-{a}^{2}} c & \frac{1} {1-{a}^{2}} c + d\end{array} \right ]\left [\begin{array}{c} {(-1)}^{L}{a}^{L}c \\ {(-1)}^{L-1}{a}^{L-1}c\end{array} \right ] \\ & =& \frac{{(-1)}^{L}{a}^{L}c} { \frac{{c}^{2}} {1-{a}^{2}} + 2 \frac{cd} {1-{a}^{2}} + {d}^{2}}\left [\begin{array}{c} \frac{c} {1-{a}^{2}} + d - \frac{c} {1-{a}^{2}} \\ \frac{ac} {1-{a}^{2}} - {a}^{-1}d - \frac{{a}^{-1}c} {1-{a}^{2}}\end{array} \right ] \\ & =& \frac{{(-1)}^{L}{a}^{L}c} { \frac{{c}^{2}} {1-{a}^{2}} + 2 \frac{cd} {1-{a}^{2}} + {d}^{2}}\left [\begin{array}{c} d\\ - {a}^{-1 } d- {a}^{-1}c\end{array} \right ]\qquad \quad \\ \end{array}$$

    If there is no additional noise, i.e., d = 0, the above result becomes

    $$\begin{array}{rcl}{ \bf{w}}_{o}& =& \left [\begin{array}{c} 0\\ {(-1)}^{L-1 } {a}^{L-1}(1 - {a}^{2})\end{array} \right ]\\ \end{array}$$

    that is, the Wiener solution is just correcting the gain of the previously received component of the input signal, namely x(k − 1), while not using its most recent component x(k). This happens because the desired signal s(k − L) at instant k has a defined correlation with any previously received symbol. On the other hand, if the signal s(k) is a colored noise the Wiener filter would have a nonzero first coefficient in a noiseless environment. In case there is environmental noise, the solution tries to find a perfect balance between the desired signal modeling and the noise amplification.

  4. (d)

    In the system identification example the input signal correlation matrix is given by

    $$\begin{array}{rcl} \bf{R}& =& \left [\begin{array}{cc} c&0\\ 0 & c\end{array} \right ].\end{array}$$

    With the desired signal d(k), the cross-correlation vector is described as

    $$\begin{array}{rcl} \bf{p}& =& \left [\begin{array}{c} E[x(k)d(k)]\\ E[x(k - 1)d(k)] \\ \end{array} \right ] = \left [\begin{array}{c} c\\ - ca\end{array} \right ]\\ \end{array}$$

    The coefficients of the underlying Wiener solution are given by

    $$\begin{array}{rcl}{ \bf{w}}_{o}& =&{ \bf{R}}^{-1}\bf{p} = \left [\begin{array}{cc} \frac{1} {c} & 0 \\ 0 &\frac{1} {c}\end{array} \right ]\left [\begin{array}{c} c\\ - ca\end{array} \right ] = \left [\begin{array}{c} 1\\ -a\end{array} \right ]\\ \end{array}$$

    Note that this solution represents the best way a first-order FIR model can approximate an IIR model, since

    $$\begin{array}{rcl}{ W}_{o}(z) = 1 - a{z}^{-1}& & \\ \end{array}$$

    and

    $$\begin{array}{rcl} \frac{1} {1 + a{z}^{-1}} = 1 - a{z}^{-1} + {a}^{2}{z}^{-2} + \cdots & & \\ \end{array}$$

    On the other hand, if the unknown model is the described FIR model such as \(v(k) = -ax(k - 1) + x(k)\), the Wiener solution remains the same and corresponds exactly to the unknown system model.

    In all these examples, the environmental signals are considered WSS and their statistics assumed known. In a practical situation, not only the statistics might be unknown but the environments are usually nonstationary as well. In these situations, the adaptive filters come into play since their coefficients vary with time according to measured signals from the environment.

2.10.5 Digital Communication System

For illustration, a general digital communication scheme over a channel consisting of a subscriber line (telephone line, for example) is shown in Fig. 2.15. In either end, the input signal is first coded and conditioned by a transmit filter. The filter shapes the pulse and limits in band the signal that is actually transmitted.The signal then crosses the hybrid to travel through a dual duplex channel. The hybrid is an impedance bridge used to transfer the transmit signal into the channel with minimal leakage to the near-end receiver. The imperfections of the hybrid cause echo that should be properly cancelled.

Fig. 2.15
figure 15

General digital communication transceiver

In the channel, the signal is corrupted by white noise and crosstalk (leakage of signals being transmitted by other subscribers).After crossing the channel and the far-end hybrid, the signal is filtered by the receive filter that attenuates high-frequency noise and also acts as an antialiasing filter. Subsequently, we have a joint DFE and echo canceller, where the forward filter and echo canceller outputs are subtracted. The result after subtracting the decision feedback output is applied to the decision device. After passing through the decision device, the symbol is decoded.

Other schemes for data transmission in subscriber line exist [33]. The one shown here is for illustration purposes, having as special feature the joint equalizer and echo canceller strategy. The digital subscriber line (DSL) structure shown here has been used in integrated services digital network (ISDN) basic access that allows a data rate of 144 Kbits/s [33]. Also, a similar scheme is employed in the high bit rate digital subscriber line (HDSL) [3240] that operates over short and conditioned loops [4142]. The latter system belongs to a broad class of digital subscriber line collectively known as XDSL.

In wireless communications, the information is transported by propagating electromagnetic energy through the air. The electromagnetic energy is radiated to the propagation medium via an antenna. In order to operate wireless transmissions, the service provider requires authorization to use a radio bandwidth from government regulators. The demand for wireless data services is more than doubling each year leading to foreseeable spectrum shortage in the years to come. As a consequence, all efforts to maximize the spectrum usage is highly desirable and for sure the adaptive filtering techniques play an important role in achieving this goal. Several examples in the book illustrate how the adaptive filters are employed in many communication systems so that the readers can understand some applications in order to try some new they envision.

2.11 Concluding Remarks

In this chapter, we described some of the concepts underlying the adaptive filtering theory. The material presented here forms the basis to understand the behavior of most adaptive-filtering algorithms in a practical implementation. The basic concept of the MSE surface searching algorithms was briefly reviewed, serving as a starting point for the development of a number of practical adaptive-filtering algorithms to be presented in the following chapters. We illustrated through several examples the expected Wiener solutions in a number of distinct situations. In addition, we presented the basic concepts of linearly constrained Wiener filter required in array signal processing. The theory and practice of adaptive signal processing is also the main subject of some excellent books such as [284351].

2.12 Problems

  1. 1.

    Suppose the input signal vector is composed by a delay line with a single input signal, compute the correlation matrix for the following input signals:

    1. (a)
      $$x(k) =\sin \left (\frac{\pi } {6} k\right ) +\cos \left (\frac{\pi } {4} k\right ) + n(k)$$
    2. (b)
      $$x(k) = a{n}_{1}(k)\cos \left ({\omega }_{0}k\right ) + {n}_{2}(k)$$
    3. (c)
      $$x(k) = a{n}_{1}(k)\sin \left ({\omega }_{0}k + {n}_{2}(k)\right )$$
    4. (d)
      $$x(k) = -{a}_{1}x(k - 1) - {a}_{2}x(k - 2) + n(k)$$
    5. (e)
      $$x(k) = \sum\limits_{i=0}^{4}0.25n(k - i)$$
    6. (f)
      $$x(k) = an(k){\mathrm{e}}^{j{\omega }_{0}k}$$

    In all cases, n(k), n 1(k), and n 2(k) are white-noise processes, with zero mean and with variances σ n 2, \({\sigma }_{{n}_{1}}^{2}\), and \({\sigma }_{{n}_{2}}^{2}\), respectively. These random signals are considered independent.

  2. 2.

    Consider two complex random processes represented by x(k) and y(k).

    1. (a)

      Derive \({\sigma }_{xy}^{2}(k,l) = E[(x(k) - {m}_{x}(k))(y(l) - {m}_{y}(l))]\) as a function of r xy (k, l), m x (k) and m y (l).

    2. (b)

      Repeat (a) if x(k) and y(k) are jointly WSS.

    3. (c)

      Being x(k) and y(k) orthogonal, in which conditions are they not correlated?

  3. 3.

    For the correlation matrices given below, calculate their eigenvalues, eigenvectors, and conditioning numbers.

    1. (a)
      $$\begin{array}{rcl} \bf{R} = \frac{1} {4}\left [\begin{array}{cccc} 4&3&2&1\\ 3 &4 &3 &2 \\ 2&3&4&3\\ 1 &2 &3 &4\\ \end{array} \right ]& & \\ \end{array}$$
    2. (b)
      $$\begin{array}{rcl} \bf{R} = \left [\begin{array}{cccc} 1 & 0.95 &0.9025&0.857375\\ 0.95 & 1 & 0.95 & 0.9025 \\ 0.9025 & 0.95 & 1 & 0.95\\ 0.857375 &0.9025 & 0.95 & 1\\ \end{array} \right ]& & \\ \end{array}$$
    3. (c)
      $$\begin{array}{rcl} \bf{R} = 50{\sigma }_{n}^{2}\left [\begin{array}{cccc} 1 &0.9899& 0.98 & 0.970\\ 0.9899 & 1 &0.9899 & 0.98 \\ 0.98 &0.9899& 1 &0.9899\\ 0.970 & 0.98 &0.9899 & 1\\ \end{array} \right ]& & \\ \end{array}$$
    4. (d)
      $$\begin{array}{rcl} \bf{R} = \left [\begin{array}{cccc} 1 & 0.5 &0.25&0.125\\ 0.5 & 1 & 0.5 & 0.25 \\ 0.25 & 0.5 & 1 & 0.5\\ 0.125 &0.25 & 0.5 & 1\\ \end{array} \right ]& & \\ \end{array}$$
  4. 4.

    For the correlation matrix given below, calculate its eigenvalues and eigenvectors, and form the matrix Q.

    $$\begin{array}{rcl} \bf{R} = \frac{1} {4}\left [\begin{array}{cc} {a}_{1} & {a}_{2} \\ {a}_{2} & {a}_{1}\\ \end{array} \right ]& & \\ \end{array}$$
  5. 5.

    The input signal of a second-order adaptive filter is described by

    $$\begin{array}{rcl} x(k) = {\alpha }_{1}{x}_{1}(k) + {\alpha }_{2}{x}_{2}(k)& & \\ \end{array}$$

    where x 1(k) and x 2(k) are first-order AR processes and uncorrelated between themselves having both unit variance. These signals are generated by applying distinct white noises to first-order filters whose poles are placed at a and − b, respectively.

    1. (a)

      Calculate the autocorrelation matrix of the input signal.

    2. (b)

      If the desired signal consists of α3 x 2(k), calculate the Wiener solution.

  6. 6.

    The input signal of a first-order adaptive filter is described by

    $$\begin{array}{rcl} x(k) = \sqrt{2}{x}_{1}(k) + {x}_{2}(k) + 2{x}_{3}(k)& & \\ \end{array}$$

    where x 1(k) and x 2(k) are first-order AR processes and uncorrelated between themselves having both unit variance. These signals are generated by applying distinct white noises to first-order filters whose poles are placed at − 0. 5 and \(\frac{\sqrt{2}} {2}\), respectively. The signal x 3(k) is a white noise with unit variance and uncorrelated with x 1(k) and x 2(k).

    1. (a)

      Calculate the autocorrelation matrix of the input signal.

    2. (b)

      If the desired signal consists of \(\frac{1} {2}{x}_{3}(k)\), calculate the Wiener solution.

  7. 7.

    Repeat the previous problem if the signal x 3(k) is exactly the white noise that generated x 2(k).

  8. 8.

    In a prediction case a sinusoid is corrupted by noise as follows

    $$\begin{array}{rcl} x(k) =\cos {\omega }_{0}k + {n}_{1}(k)& & \\ \end{array}$$

    with

    $$\begin{array}{rcl}{ n}_{1}(k) = -a{n}_{1}(k - 1) + n(k)& & \\ \end{array}$$

    where | a |  < 1. For this case describe the Wiener solution with two coefficients and comment on the results.

  9. 9.

    Generate the ARMA processes x(k) described below. Calculate the variance of the output signal and the autocorrelation for lags 1 and 2. In all cases, n(k) is zero-mean Gaussian white noise with variance 0.1.

    1. (a)
      $$\begin{array}{rcl} x(k)& =& 1.9368x(k - 1) - 0.9519x(k - 2) + n(k) \\ & & -1.8894n(k - 1) + n(k - 2) \\ \end{array}$$
    2. (b)
      $$\begin{array}{rcl} x(k)& =& -1.9368x(k - 1) - 0.9519x(k - 2) + n(k) \\ & & +1.8894n(k - 1) + n(k - 2) \\ \end{array}$$

    Hint: For white noise generation consult for example [1516].

  10. 10.

    Generate the AR processes x(k) described below. Calculate the variance of the output signal and the autocorrelation for lags 1 and 2. In all cases, n(k) is zero-mean Gaussian white noise with variance 0.05.

    1. (a)
      $$x(k) = -0.8987x(k - 1) - 0.9018x(k - 2) + n(k)$$
    2. (b)
      $$x(k) = 0.057x(k - 1) + 0.889x(k - 2) + n(k)$$
  11. 11.

    Generate the MA processes x(k) described below. Calculate the variance of the output signal and the autocovariance matrix. In all cases, n(k) is zero-mean Gaussian white noise with variance 1.

    1. (a)
      $$\begin{array}{rcl} x(k)& =& 0.0935n(k) + 0.3027n(k - 1) + 0.4n(k - 2) \\ & & +\ 0.3027n(k - 4) + 0.0935n(k - 5) \\ \end{array}$$
    2. (b)
      $$x(k) = n(k) - n(k - 1) + n(k - 2) - n(k - 4) + n(k - 5)$$
    3. (c)
      $$x(k) = n(k) + 2n(k - 1) + 3n(k - 2) + 2n(k - 4) + n(k - 5)$$
  12. 12.

    Show that a process generated by adding two AR processes is in general an ARMA process.

  13. 13.

    Determine if the following processes are mean ergodic:

    1. (a)
      $$x(k) = a{n}_{1}(k)\cos ({\omega }_{0}k) + {n}_{2}(k)$$
    2. (b)
      $$x(k) = a{n}_{1}(k)\sin ({\omega }_{0}k + {n}_{2}(k))$$
    3. (c)
      $$x(k) = an(k){\mathrm{e}}^{2j{\omega }_{0}k}$$

      In all cases, n(k), n 1(k), and n 2(k) are white-noise processes, with zero mean and with variances σ n 2, \({\sigma }_{{n}_{1}}^{2}\), and \({\sigma }_{{n}_{2}}^{2}\), respectively. These random signals are considered independent.

  14. 14.

    Show that the minimum (maximum) value of (2.69) occurs when w i  = 0 for ij and λ j is the smallest (largest) eigenvalue, respectively.

  15. 15.

    Suppose the matrix R and the vector p are known for a given experimental environment. Compute the Wiener solution for the following cases:

    1. (a)
      $$\begin{array}{rcl} \bf{R} = \frac{1} {4}\left [\begin{array}{cccc} 4&3&2&1\\ 3 &4 &3 &2 \\ 2&3&4&3\\ 1 &2 &3 &4\\ \end{array} \right ]& & \\ \end{array}$$
      $$\begin{array}{rcl} \bf{p} ={ \left [\frac{1} {2}\:\:\frac{3} {8}\:\:\frac{2} {8}\:\:\frac{1} {8}\right ]}^{T}& & \\ \end{array}$$
    2. (b)
      $$\begin{array}{rcl} \bf{R} = \left [\begin{array}{cccc} 1 & 0.8 &0.64&0.512\\ 0.8 & 1 & 0.8 & 0.64 \\ 0.64 & 0.8 & 1 & 0.8\\ 0.512 &0.64 & 0.8 & 1\\ \end{array} \right ]& & \\ \end{array}$$
      $$\begin{array}{rcl} \bf{p} = \frac{1} {4}{\left [0.4096\:\:0.512\:\:0.64\:\:0.8\right ]}^{T}& & \\ \end{array}$$
    3. (c)
      $$\begin{array}{rcl} \bf{R} = \frac{1} {3}\left [\begin{array}{ccc} 3 & - 2& 1\\ - 2 & 3 & -2 \\ 1 & - 2& 3\\ \end{array} \right ]& & \\ \bf{p} ={ \left [-2\:\:1\:\: -\frac{1} {2}\right ]}^{T}& & \\ \end{array}$$
  16. 16.

    For the environments described in the previous problem, derive the updating formula for the steepest-descent method. Considering that the adaptive-filter coefficients are initially zero, calculate their values for the first ten iterations.

  17. 17.

    Repeat the previous problem using the Newton method.

  18. 18.

    Calculate the spectral decomposition for the matrices R of Problem 15.

  19. 19.

    Calculate the minimum MSE for the examples of Problem 15 considering that the variance of the reference signal is given by σ d 2.

  20. 20.

    Derive (2.112).

  21. 21.

    Derive the constraint matrix C and the gain vector f that impose the condition of linear phase onto the linearly constrained Wiener filter.

  22. 22.

    Show that the optimal solutions of the LCMV filter and the GSC filter with minimum norm are equivalent and related according to w LCMV = Tw GSC, where T = [C B] is a full-rank transformation matrix with C T B = 0 and

    $$\begin{array}{rcl}{ \bf{w}}_{\mathrm{LCMV}} ={ \bf{R}}^{-1}\bf{C}{({\bf{C}}^{T}{\bf{R}}^{-1}\bf{C})}^{-1}\bf{f}& & \\ \end{array}$$

    and

    $$\begin{array}{rcl}{ \bf{w}}_{\mathrm{GSC}} = \left [\begin{array}{*{10}c} {({\bf{C}}^{T}\bf{C})}^{-1}\bf{f} \\ -{({\bf{B}}^{T}\bf{R}\bf{B})}^{-1}{\bf{B}}^{T}\bf{R}\bf{C}{({\bf{C}}^{T}\bf{C})}^{-1}\bf{f}\end{array} \right ]& & \\ \end{array}$$
  23. 23.

    Calculate the time constants of the MSE and of the coefficients for the examples of Problem 15 considering that the steepest-descent algorithm was employed.

  24. 24.

    For the examples of Problem 15, describe the equations for the MSE surface.

  25. 25.

    Using the spectral decomposition of a Hermitian matrix show that

    $${\bf{R}}^{ \frac{1} {N} } = \bf{Q}{{\Lambda }}^{ \frac{1} {N} }{\bf{Q}}^{H} = \sum\limits_{i=0}^{N}{\lambda }_{ i}^{ \frac{1} {N} }{\bf{q}}_{i}{\bf{q}}_{i}^{H}$$
  26. 26.

    Derive the complex steepest-descent algorithm.

  27. 27.

    Derive the Newton algorithm for complex signals.

  28. 28.

    In a signal enhancement application, assume that n 1(k) = n 2(k) ∗ h(k), where h(k) represents the impulse response of an unknown system. Also, assume that some small leakage of the signal x(k), given by h′(k) ∗ x(k), is added to the adaptive-filter input. Analyze the consequences of this phenomenon.

  29. 29.

    In the equalizer application, calculate the optimal equalizer transfer function when the channel noise is present.