Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In the previous chapters, methods for detecting control points in two images of a scene and methods for determining the correspondence between the control points were discussed. In this chapter, robust methods that use the control-point correspondences to determine the parameters of a transformation function to register the images are discussed. Transformation functions for image registration will be discussed in the following chapter.

Although inaccuracies in the coordinates of corresponding points can be managed if the inaccuracies have a normal distribution with a mean of zero, but presence of even one incorrect correspondence can break down the parameter estimation process. When using image features/descriptors to find the correspondence between control points in two images, presence of noise, repeated patterns, and geometric and intensity differences between the images can result in some incorrect correspondences. Not knowing which correspondences are correct and which ones are not, the job of a robust estimator is to identify some or all of the correct correspondences and use their coordinates to determine the transformation parameters.

In the previous chapter, RANSAC, a robust estimator widely used in the computer vision community was reviewed. In this chapter, mathematically well-known robust estimators that are not widely used in computer vision and image analysis applications are reviewed. As we will see, these estimators can often replace RANSAC and sometimes outperform it.

The general problem to be addressed in this chapter is as follows. Given n corresponding points in two images of a scene:

$$\bigl\{(x_i,y_i),(X_i,Y_i): i=1,\ldots,n\bigr\},$$
(8.1)

we would like to find the parameters of a transformation function with two components f x and f y that satisfy

$$\begin{array}{rcl}X_i &\approx& f_x(x_i,y_i),\\[3pt]Y_i &\approx& f_y(x_i,y_i),\end{array} \quad i=1,\ldots, n.$$
(8.2)

If the components of the transformation are independent of each other, their parameters can be determined separately. In such a situation, it is assumed that

$$\bigl\{(x_i,y_i,F_i):i=1,\ldots,n\bigr\} $$
(8.3)

is given and it is required to find the parameters of function f to satisfy

$$F_i\approx f(x_i,y_i), \quad i=1,\ldots,n. $$
(8.4)

By letting F i =X i , the estimated function will represent f x and by letting F i =Y i , the estimated function will represent f y . If the two components of a transformation are dependent, such as the component of a projective transformation, both components of the transformation are estimated simultaneously.

f can be considered a single-valued surface that approximates the 3-D points given by (8.3). If the points are on or near the model to be estimated, f will approximate the model closely. However, if some points are away from the model to be estimated, f may be quite different from the model. The role of a robust estimator is to find the model parameters accurately even in the presence of distant points (outliers).

We assume each component of the transformation to be determined can be represented by a linear function of its parameters. That is

$$f=\mathbf{x}^t\mathbf{a}, $$
(8.5)

where a={a 1,…,a m } are the m unknown parameters of the model and x is a vector with m components, each a function of x and y. For instance, when f represents a component of an affine transformation, we have

$$f=a_1x+a_2y+a_3, $$
(8.6)

and so x t=[x y 1] and a t=[a 1 a 2 a 3]. When f represents a quadratic function, we have

$$f=a_1x^2+a_2y^2+a_3xy+a_4x+a_5y+a_6, $$
(8.7)

and so x t=[x 2 y 2 xy x y x 1] and a t=[a 1 a 2 a 3 a 4 a 5 a 6].

When the observations given by (8.3) are contaminated, the estimated parameters will contain errors. Substituting (8.5) into (8.4) and rewriting it to include errors at the observations, we obtain

$$F_i=\mathbf{x}_i^t\mathbf{a}+e_i, \quad i=1,\ldots,n, $$
(8.8)

where e i is the vertical distance of F i to the surface to be estimated at (x i ,y i ) as shown in Fig. 8.1. This is the estimated positional error in a component of the ith point in the sensed image. Not knowing which correspondences are correct and which ones are not, an estimator finds the model parameters in such a way as to minimize some measure of error between the given data and the estimated model.

Fig. 8.1
figure 1

Linear parameter estimation using contaminated data

In the remainder of this chapter, first the ordinary least squares (OLS) estimation is described. OLS performs well when the errors have a normal distribution. When errors have a long-tailed distribution, often caused by outliers, it performs poorly. Next, robust estimators that reduce or eliminate the influence of outliers on estimated parameters are discussed.

To evaluate and compare the performances of various estimators, 100 control points detected in each of the coin images in Fig. 6.2 will be used. The coordinates of the points are shown in Table 8.1. The points in the original coin image (Fig. 6.2a) are used as the reference points and denoted by (x,y). The control points detected in the blurred, noisy, contrast-enhanced, rotated, and scaled versions of the coin image are considered sensed points and are denoted by (X,Y).

Table 8.1 The point sets used to evaluate the performances of various estimators. (x,y) denote the column and row numbers of control points in the reference image, and (X,Y) denote the column and row numbers of control points in a sensed image. A sensed point that is found to correctly corresponds to a reference point is marked with a ‘+’. The remaining points represent outliers. A sensed point marked with a ‘−’ is a point that is incorrectly assigned to a reference point by the matching algorithm

Correspondence was established between the reference point set and each of the sensed point sets by a graph-based matching algorithm with a rather large distance tolerance (ε=10 pixels) to allow inaccurate and incorrect correspondences enter the process. The correspondences established between each sensed point set and the reference point set are marked with a ‘+’ or a ‘−’ in Table 8.1. A ‘+’ indicates a correspondence that is correct, while a ‘−’ indicates a correspondence that is incorrect.

The algorithm found 95, 98, 98, 96, and 78 correspondences between the coin image and its blurred, noisy, contrast-enhanced, rotated, and scaled versions, respectively. Among the obtained correspondences, only 66, 60, 68, 48, and 28 are correct. Due to the large distance tolerance used in matching, the process has picked all of the correct correspondences (true positives). However, due to the large distance tolerance, it has also picked a large number of incorrect correspondences (false positives).

Establishing correspondence between points by the closest-point criterion resulted in some reference points being assigned to two or more sensed points. Although multiple assignments are easy to detect and remove, by removing such assignments, we run the risk of eliminating some correct correspondences, something that we want to avoid. Therefore, we keep the contaminated correspondences found by our matching algorithm and use them to determine the parameters of the transformation between each sensed image and the reference image by various estimators. After finding the transformation parameters by a robust estimator, we will then separate the correct correspondences from the incorrect ones.

The parameters of the affine transformations truly relating the blurred, noisy, contrast-enhanced, rotated, and scaled images to the original image are listed in Table 8.2. Knowing the true transformation parameters between each sensed image and the reference image, we would like to see how accurately various estimators can find these parameters using the contaminated correspondences shown in Table 8.1

Table 8.2 True linear transformation parameters between the blurred, noisy, contrast-enhanced, rotated, and scaled coin images and the original coin image

8.1 OLS Estimator

Letting x ij represent the jth element of x when evaluated at the ith data point, relation (8.8) can be written as

$$F_i=\sum_{j=1}^m x_{ij}a_j +e_i, \quad i=1,\ldots,n. $$
(8.9)

e i is positive when the given data point falls above the approximating surface, and e i is negative when the point falls below the surface. Assuming the error at a data point is independent of errors at other data points and the errors have a Gaussian distribution, the ordinary least-squares (OLS) estimator finds the parameters of the model by minimizing the sum of squared vertical distance between the data and the estimated surface:

$$R= \sum_{i=1}^n r_i^2, $$
(8.10)

where

$$r_i=F_i-\sum_{j=1}^mx_{ij}a_j.$$
(8.11)

Vertical distance or residual r i can be considered an estimate of the actual error e i at the ith point. If the components of a transformation depend on each other, the squared residual at the ith point will be

$$r_i^2=\Biggl(X_i-\sum_{j=1}^{m_x}x_{ij}a_j\Biggr)^2+\Biggl(Y_i-\sum_{j=1}^{m_y}x_{ij}b_j\Biggr)^2,$$
(8.12)

where {a j :j=1,…,m x } are the parameters describing the x-component of the transformation, and {b j :j=1,…,m y } are the parameters describing the y-component of the transformation. When the two components of a transformation function are interdependent, some parameters appear in both components. For instance, in the case of the projective transformation, we have

$$ X = \frac{{a_1 \left| x \right. + a_2 y + a_3 }}{{a_7 x + a_8 y + 1}}, $$
(8.13)
$$ Y = \frac{{a_4 \left| x \right. + a_5 y + a_6 }}{{a_7 x + a_8 y + 1}}, $$
(8.14)

or

$$ a_7 xX + a_8 yX + X = a_1 x + a_2 y + a_3 , $$
(8.15)
$$ a_7 xY + a_8 yY + Y = a_4 x + a_5 y + a_6 , $$
(8.16)

so the squared distance between the ith point and the transformation function will be

$$ \begin{array}{*{20}l} {r_i^2 } \hfill & {\, = (a_7 x_i X_i + a_8 y_i X_i + X_i - a_1 x_i - a_2 y_i - a_3 )^2 } \hfill \\ {} \hfill & {\,\,\,\,\,\,\, + \,(a_7 x_i Y_i + a_8 y_i Y_i + Y_i - a_4 x_i - a_5 y_i - a_6 )^2 .} \hfill \\ \end{array} $$
(8.17)

The linear parameters a 1,…,a 8 are estimated by minimizing the sum of such squared distances or residuals.

To find the parameters that minimize the sum of squared residuals R, the gradient of R is set to 0 and the obtained system of linear equations is solved. For example, a component of an affine transformation (m=3) is determined by solving

$$ \begin{array}{*{20}l} {\frac{{\partial R}}{{\partial a_1 }}} \hfill & {\, = - 2\sum\limits_{i = 1}^n {x_i (F_i - a_1 x_i - a_2 y_i - a_3 ) = 0,} } \hfill \\ {\frac{{\partial R}}{{\partial a_2 }}} \hfill & {\, = - 2\sum\limits_{i = 1}^n {y_i (F_i - a_1 x_i - a_2 y_i - a_3 ) = 0,} } \hfill \\ {\frac{{\partial R}}{{\partial a_3 }}} \hfill & { = - 2\sum\limits_{i = 1}^n {(F_i - a_1 x_i - a_2 y_i - a_3 ) = 0,} } \hfill \\ \end{array} $$
(8.18)

which can be written as

$$ \left( {\begin{array}{*{20}l} {\sum _{i = 1}^n x_i^2 } \hfill & {\sum _{i = 1}^n x_i y_i } \hfill & {\sum _{i = 1}^n x_i } \hfill \\ {\sum _{i = 1}^n x_i y_i } \hfill & {\sum _{i = 1}^n y_i^2 } \hfill & {\sum _{i = 1}^n y_i } \hfill \\ {\sum _{i = 1}^n x_i } \hfill & {\sum _{i = 1}^n y_i } \hfill & \,\,\,\,\,\,\,\,n \hfill \\ \end{array}} \right)\left( {\begin{array}{*{20}c} {a_1 } \\ {a_2 } \\ {a_3 } \\ \end{array}} \right) = \left( {\begin{array}{*{20}c} {\sum _{i = 1}^n x_i F_i } \\ {\sum _{i = 1}^n y_i F_i } \\ {\sum _{i = 1}^n F_i } \\ \end{array}} \right). $$
(8.19)

In matrix form, this can be written as

$$\mathbf{A}^t\mathbf{A}\mathbf{X}=\mathbf{A}^t\mathbf{b}, $$
(8.20)

where A is an n×3 matrix with A i1=x i , A i2=y i , and A i3=1; \(\bf b\) is an n×1 array with b i =F i ; and X is a 3×1 array of unknowns. Generally, when f is a function of m variables, A ij represents the partial derivative of f with respect to the jth parameter when evaluated at the ith point.

We see that (8.20) is the same as left multiplying both sides of equation

$$\mathbf{A}\mathbf{X}=\mathbf{b} $$
(8.21)

by A t, and (8.21) is an overdetermined system of equations for which there isn’t an exact solution. Therefore, OLS finds the solution to this overdetermined system of linear equations in such a way that the sum of squared residuals obtained at the data points becomes minimum.

If (8.20) has full rank m, its solution will be

$$\hat{\mathbf{X}}= \bigl(\mathbf{A}^t\mathbf{A}\bigr)^{-1}\mathbf{A}^t\mathbf{b}.$$
(8.22)

Matrix A =(A t A)−1 A t is known as the pseudo-inverse of A [4, 22]. Therefore,

$$\hat{\mathbf{X}}=\mathbf{A}^\dag\mathbf{b}. $$
(8.23)

The OLS estimator was developed independently by Gauss and Legendre. Although Legendre published the idea in 1805 and Gauss published it in 1809, records show that Gauss has been using the method since 1795 [31]. It has been shown that if (1) data represent random observations from a model with linear parameters, (2) errors at the points have a normal distribution with a mean of zero, and (3) the variables are independent, then the parameters determined by OLS represent the best linear unbiased estimation (BLUE) of the model parameters [1]. Linear independence requires that the components of x be independent of each other. An example of dependence is x 2 and xy. This implies that when least squares is used to find parameters of functions like (8.7) with x containing interdependent components, the obtained parameters may not be BLUE.

Comparing the linear model with m parameters estimated by OLS with the first m principal components about the sample mean (Sect. 8.11), we see that OLS finds the model parameters by minimizing the sum of squared distances of the points to the surface vertically, while the parameters predicted by the first m principal components of the same data minimizes the sum of squared distances measured between the points and the surface in the direction normal to the surface. Although the two use the same error measure, OLS treats one dimension of the observations preferentially, while principal component analysis (PCA) treats all dimensions of observations similarly.

In addition to treating one dimension of data preferentially, OLS lacks robustness. A single outlier can drastically change the estimated parameters. The notion of breakdown point ε , introduced by Hampel [5], is the smallest fraction of outliers that can change the estimated parameters drastically. In the case of OLS, ε =1/n.

Using the 95 points marked with ‘+’ and ‘−’ in Table 8.1 for the blurred image and the corresponding points in the original image, OLS estimated the six linear parameters shown in Table 8.3. The root-mean-squared error (RMSE) obtained at all correspondences and the RMSE obtained at the 66 correct correspondences are also shown. The estimated model parameters and RMSE measures between the noisy, contrast-enhanced, rotated, and scaled images and the original image are also shown in Table 8.3.

Table 8.3 Estimated parameters by OLS for the five data sets in Table 8.1. RMSE a indicates RMSE when using all correspondences (marked with a ‘+’ or a ‘−’) and RMSE c indicates RMSE when using only the correct correspondences (marked with a ‘+’). The last column shows computation time in seconds when using all correspondences on a Windows PC with a 2.2 MHz processor

Due to the fact that the outliers are not farther than 10 pixels from the surface to be estimated, their adverse effect on the estimated parameters is limited. Since in image registration the user can control this distance tolerance, outliers that are very far from the surface model to be estimated can be excluded from the point correspondences. Therefore, although the correspondences represent contaminated data, the maximum error an incorrect correspondence can introduce to the estimation process can be controlled. Decreasing the distance tolerance too much, however, may eliminate some of the correct correspondences, something that we want to avoid. Therefore, we would like to have the distance tolerance large enough to detect all the correct correspondences but not so large as to introduce false correspondences that can irreparably damage the estimation process.

Having contaminated data of the kind shown in Table 8.1, we would like to identify estimators that can accurately estimate the parameters of an affine transformation model and produce as small an RMSE measure as possible.

Since points with smaller residuals are more likely to represent correct correspondences than points with larger residuals, one way to reduce the estimation error is to give lower weights to points that are farther from the estimated surface. This is discussed next.

8.2 WLS Estimator

The weighted least-squares (WLS) estimator gives lower weights to points with higher square residuals. The weights are intended to reduce the influence of outliers that are far from the estimated model surface. It has been shown that OLS produces the best linear unbiased estimation of the model parameters if all residuals have the same variance [20]. It has also been shown that when the observations contain different uncertainties or variances, least-squares error is reached when the square residuals are normalized by the reciprocals of the residual variances [2]. If \(\sigma_{i}^{2}\) is the variance of the ith observation, by letting w i =1/σ i , we can normalize the residuals by replacing x i with w i x i and f i with w i f i . Therefore, letting \(A'_{ij}=A_{ij}w_{i}\) and \(b'_{i}=b_{i}w_{i}\), (8.20) converts to

$$\mathbf{A}'^t\mathbf{A}'\mathbf{X}=\mathbf{A}'^t\mathbf{b}', $$
(8.24)

producing the least squares solution

$$\mathbf{X}=\bigl(\mathbf{A}'^t\mathbf{A}'\bigr)^{-1}\mathbf{A}'^t\mathbf{b}'. $$
(8.25)

If variances at the sample points are not known, w i is set inversely proportional to the magnitude of residual at the ith observation. That is, if

$$r_i=F_i-\mathbf{x}_i\hat{\mathbf{a}}, \quad i=1,\ldots,n, $$
(8.26)

then

$$w_i={1\over{|r_i|+\varepsilon }}, \quad i=1,\ldots,n. $$
(8.27)

ε is a small number, such as 0.01, to avoid division by zero.

Since the weights depend on estimated errors at the points, better weights can be obtained by improving the estimated parameters. If (8.26) represents residuals calculated using the model surface obtained by OLS and denoting the initial model by f 0(x), the residuals at the (k+1)st iteration can be estimated from the model obtained at the kth iteration:

$$r_i^{(k+1)}=F_i-f_k(\mathbf{x}_i), \quad i=1,\ldots,n. $$
(8.28)

The process of improving the weights and the process of improving the model parameters are interconnected. From the residuals, weights at the points are calculated, and using the weights, the model parameters are estimated. The residuals are recalculated using the refined model and the process is repeated until the sum of square weighted residuals does not decrease noticeably from one iteration to the next.

Using the data in Table 8.1 and letting ε=0.01, WLS finds the model parameters shown in Table 8.4 between the blurred, noisy, contrast-enhanced, rotated, and scaled images and the original image. Only a few to several iterations were needed to obtain these parameters. The estimation errors obtained by WLS using the correct correspondences are lower than those obtained by OLS. Interestingly, the parameters and the errors obtained by OLS and WLS on the rotated data set are almost the same. Results obtained on contaminated data by WLS are not any better than those obtained by OLS.

Table 8.4 Estimated parameters by WLS for the five data sets in Table 8.1 and the RMSE measures

If some information about the uncertainties of the point correspondences is available, the initial weights can be calculated using that information. This enables estimating the initial model parameters by WLS rather than by OLS and achieving a more accurate initial model. For instance, if a point in each image has an associating feature vector, the distance between the feature vectors of the ith corresponding points can be used as |r i | in (8.27). The smaller the distance between the feature vectors of corresponding points, the more likely it will be that the correspondence is correct and, thus, the smaller the correspondence uncertainty will be.

The main objective in WLS estimation is to provide a means to reduce the influence of outliers on the estimation process. Although weighted mean can reduce the influence of distant outliers on estimated parameters, it does not diminish their influence. To completely remove the influence of distant outliers on estimated parameters, rather than using the weight function of (8.27), a weight function that cuts off observations farther away than a certain distance to the estimated surface can be used. An example of a weight function with this characteristic is

$$w_i=\left\{\begin{array}{l@{\quad}l}{1\over{|r_i|+\varepsilon }} & |r_i|\le r_0,\\[6pt]0 & |r_i|>r_0,\end{array}\right. $$
(8.29)

where r 0 is the required distance threshold to identify and remove the distant outliers.

The WLS estimator with a cut-off of r 0=2 pixels and ε=0.01 produced the model parameters shown in Table 8.5. The errors when using the correct correspondences are either the same or only slightly lower than those found by the WLS estimator without a cut-off threshold. Removing points with larger residuals does not seem to change the results significantly when using the contaminated data. If the residuals obtained with and without the cut-off threshold both have the same distribution, the same results will be produced by OLS. Because the residuals initially estimated by OLS contain errors, by removing points with high residuals or weighting them lower, the distribution of the residuals does not seem to change, resulting in the same parameters by OLS and by WLS with and without a cut-off threshold distance in this example.

Table 8.5 Estimated parameters by the weighted least squares with cut-off threshold r 0=2 pixels

8.3 M Estimator

An M estimator, like the OLS estimator, is a maximum likelihood estimator [12], but instead of minimizing the sum of squared residuals, it minimizes the sum of functions of the residuals that increases less rapidly with increasing residuals when compared with squared residuals. Consider the objective function:

$$\sum_{i=1}^n\rho(r_i), $$
(8.30)

where ρ(r i ) is a function of r i that increases less rapidly with r i when compared with the square of r i . To minimize this objective function, its partial derivatives with respect to the model parameters are set to 0 and the obtained system of equations is solved. Therefore,

$$\sum_{i=1}^n {{\partial \rho(r_i)}\over{\partial r_i}}{{\partial r_i}\over{\partial a_k}}=0,\quad k=1,\ldots,m.$$
(8.31)

Since ∂r i /∂a k =x ik , and denoting ∂ρ(r i )/∂r i by ψ(r i ), we obtain

$$\sum_{i=1}^n\psi(r_i)x_{ik}=0, \quad k=1,\ldots,m. $$
(8.32)

The residual at the ith observation, \(r_{i}=F_{i}-\sum_{j=1}^{m}x_{ij}a_{j}\), depends on the measurement scale, another unknown parameter. Therefore, rather than solving (8.32), we solve

$$\sum_{i=1}^n\psi_k\biggl( {r_i\over \sigma}\biggr)x_{ik}=0,\quad k=1,\ldots,m, $$
(8.33)

for the model parameters as well as for the scale parameter σ.

The process of determining the scale parameter and the parameters of the model involves first estimating the initial model parameters by OLS and from the residuals estimating the initial scale. A robust method to estimate scale from the residuals is the median absolute deviation [6, 12]:

$$b~ \mathit{med}_i\bigl\{|r_i-M_n|\bigr\}, $$
(8.34)

where M n =med i {r i } for i=1,…,n. To make the estimated scale comparable to the spread σ of a Gaussian distribution representing the residuals, it is required that we let b=1.483.

Knowing the initial scale, the model parameters are estimated from (8.33) by letting \(r_{i}=F_{i}-\sum_{j=1}^{m}x_{ij}a_{j}\). The process of scale and parameter estimation is repeated until the objective function defined by (8.30) reaches its minimum value.

A piecewise continuous ρ that behaves like a quadratic up to a point, beyond which it behaves linearly, is [11, 12]:

$$\rho(r)= \left\{\begin{array}{l@{\quad}l}r^2/2 & \mbox{if} \ |r|<c,\\[3pt]c|r|-{1\over 2} c^2 & \mbox{if}\ |r|\ge c.\end{array}\right. $$
(8.35)

The gradient of this function is also piecewise continuous:

$$\psi(r)= \left\{ \begin{array}{l@{\quad}l}r & \mbox{if}\ |r|<c,\\[3pt]c \operatorname{sgn} (r) & \mbox{if}\ |r|\ge c.\end{array}\right. $$
(8.36)

ρ(r) and ψ(r) curves, depicted in Fig. 8.2, reduce the effect of distant outliers by switching from quadratic to linear at the threshold distance c. To achieve an asymptotic efficiency of 95%, it is required that we set c=1.345σ when residuals have a normal distribution with spread σ.

Fig. 8.2
figure 2

(a) The plot of ρ(r) curve of (8.35). (b) The plot of ψ(r) curve of (8.36)

The gradient of the objective function, known as the influence function, is a linear function of the residuals or a constant in this example. Therefore, the parameters of the model can be estimated by solving a system of linear equations. Although this M estimator reduces the influence of distant outliers and produces more robust parameters than those obtained by OLS, the breakdown point of this estimator is also ε =1/n. This is because the objective function still monotonically increases with increasing residuals and a single distant outlier can arbitrarily change the estimated parameters.

To further reduce the influence of outliers, consider [28]:

$$\rho(r)= \left\{ \begin{array}{l@{\quad}l}{r^2\over 2}-{r^4\over{2c^2}}+{r^6\over{6c^4}} & \mbox{if}\ |r|\le c,\\[6pt]{c^2\over 6} & \mbox{if}\ |r|>c.\end{array} \right. $$
(8.37)

This ρ(r) is also a piecewise function. It is a function of degree six in r up to distance c, beyond which it changes to a constant, treating all residuals with magnitudes larger than c similarly. This estimator will, in effect, avoid distant outliers to arbitrarily change the estimated parameters. The gradient of ρ(r) is:

$$\psi(r)= \left\{ \begin{array}{l@{\quad}l}r[1-({r\over c})^2 ]^2 & \mbox{if}\ |r|\le c,\\[6pt]0 & \mbox{if}\ |r|> c.\end{array}\right. $$
(8.38)

ρ(r) and ψ(r) curves are plotted in Fig. 8.3. Setting parameter c=4.685σ, 95% asymptotic efficiency is reached when residuals have a normal distribution with spread of σ.

Fig. 8.3
figure 3

(a) The plot of ρ(r) of (8.37). (b) The plot of ψ(r) of (8.38)

Note that the influence function in this example is a nonlinear function of the residuals, requiring the solution of a nonlinear system of equations to estimate the model parameters, which can be very time consuming. The objective function, by assuming a fixed value for residuals larger than a given magnitude, keeps the maximum influence an outlier can have on the estimated parameters under control. In this M estimator, a distant outlier can also adversely affect the estimated parameters, although the effect is not as damaging as the M estimator with the objective and influence curves defined by (8.35) and (8.36).

8.4 S Estimator

The scale (S) estimator makes estimation of the scale parameter σ in an M estimator the central problem [28]. An S estimator has the following properties:

  1. 1.

    The ρ curve in the objective function is continuously differentiable and symmetric, and it evaluates to 0 at 0 (i.e., ρ(0)=0).

  2. 2.

    There exists an interval [0,c] (c>0), where ρ is monotonically increasing, and an interval (c,∞), where ρ is a constant.

  3. 3.
    $${E(\rho)\over \rho(c)}=0.5,$$
    (8.39)

    where E(ρ) is the expected value of ρ.

An example of such an estimator is [28]:

$$\rho(r)= \left\{ \begin{array}{l@{\quad}l}{r^2\over 2}-{r^4\over{2c^2}}+{r^6\over{6c^4}} & \mbox{if}\ |r|\le c,\\[3pt]{c^2\over 6} & \mbox{if}\ |r|>c,\end{array} \right.$$
(8.40)

with influence curve

$$\psi(r)= \left\{ \begin{array}{l@{\quad}l}r[1-({r\over c})^2 ]^2 & \mbox{if}\ |r|\le c,\\[3pt]0 & \mbox{if}\ |r|> c.\end{array}\right. $$
(8.41)

The third property is achieved in this example by letting c=1.547 [28].

Given residuals {r i :i=1,…,n} and letting \(\hat{\mathbf{a}}\) be the model parameters estimated by OLS, the scale parameter σ is estimated by solving

$${1\over n}\sum_{i=1}^n \rho\bigl(r_i(\hat{\mathbf{a}})/\hat{\sigma}\bigr)=K,$$
(8.42)

where K is the expected value of ρ. If there is more than one solution, the largest scale is taken as the solution, and if there is no solution, the scale is set to 0 [28]. Knowing scale, \(\bf a\) is estimated, and the process of estimating σ and \(\bf a\) is repeated until dispersion among the residuals reaches a minimum.

A robust method for estimating the initial scale is the median absolute deviation described by (8.34) [6, 12]. An alternative robust estimation of the scale parameter is [25]:

$$1.193~ \mathit{med}_i\bigl\{ \mathit{med}_j\bigl\{|r_i-r_j|\bigr\}\bigr\}.$$
(8.43)

For each r i , the median of {|r i r j |:j=1,…,n} is determined. By varying i=1,…,n, n numbers are obtained, the median of which will be the estimated scale. The number 1.193 is to make the estimated scale consistent with the scale σ of the Gaussian approximating the distribution of the residuals.

If ρ possesses the three properties mentioned above, the breakdown point of the S estimator will be [28]:

$$\varepsilon ^*={1\over n}\biggl( \bigg\lfloor {n\over 2}\bigg\rfloor -m+2 \biggr).$$
(8.44)

As n approaches ∞, the breakdown point of the S estimator approaches 0.5. This high breakdown point of the S estimator is due to the second property of the ρ curve that is required to have a constant value beyond a certain point. This will stop a single outlier from influencing the outcome arbitrarily. Note that although an outlier in the S estimator is not as damaging as it can be, an outlier still adversely affects the estimated parameters and as the number of outliers increases, the estimations worsen up to the breakdown point, beyond which there will be a drastic change in the estimated parameters.

To summarize, an S estimator first determines the residuals using OLS or a more robust estimator. Then the scale parameter is estimated using the residuals. Knowing an estimation \(\hat{\sigma}\) to the scale parameter, r i is replaced with \(r_{i}/\hat{\sigma}\) and the influence function is solved for the parameters of the model. Note that this requires the solution of a system of nonlinear equations. Having the estimated model parameters \(\hat{\mathbf{a}}\), the process of finding the residuals, estimating the scale, and estimating the model parameters is repeated until a minimum is reached in the estimated scale, showing minimum dispersion of the obtained residuals.

8.5 RM Estimator

The repeated median (RM) estimator works with the median of the parameters estimated by different combinations of m points out of n [32]. If there are n points and m model parameters, there will be overall n!/[m!(nm)!] or O(n m) combinations of points that can be used to estimate the model parameters.

Now consider the following median operator:

$$M\bigl\{\tilde{\mathbf{a}}(i_1,\ldots,i_m)\bigr\}= \mathit{med}_{i_m}\bigl\{\tilde{\mathbf{a}}(i_1,\ldots,i_{m-1},i_m)\bigr\},$$
(8.45)

where the right-hand side is the median of parameters \(\tilde{\mathbf{a}}(i_{1},\ldots,i_{m-1},i_{m})\) as point i m is replaced with all points not already among the m points. Every time the operator is called, it replaces one of its m points with all points not already in use. By calling the operator m times, each time replacing one of its points, the median parameters for all combinations of m points out of n will be obtained. The obtained median parameters are taken as the parameters of the model.

$$ \begin{array}{*{20}l} {\mathbf{\hat a}} \hfill & {\,{\rm = }M^m \left\{ {{\rm \mathbf{\tilde a}(}i_1 ,...,i_m {\rm )}} \right\},} \hfill \\ \end{array} $$
(8.46)
$$ \begin{array}{*{20}l} {} \hfill & {\,\,\,\,{\rm = }med_{i1} ( \cdot \cdot \cdot (med_{im} - 1(med_{im} {\rm \mathbf{\tilde a}(}i_{\rm 1} {\rm ,}...{\rm ,}i_m {\rm )})) \cdot \cdot \cdot ).} \hfill \\ \end{array} $$
(8.47)

The process of estimating the model parameters can be considered m nested loops, where each loop goes through the n points except for the ones already in use by the outer loops and determines the parameters of the model for each combination of m points. The median of each parameter is used as the best estimate of that parameter.

When n is very large, an exhaustive search for the optimal parameters will become prohibitively time consuming, especially when m is also large. To reduce computation time without significantly affecting the outcome, only point combinations that are sufficiently far from each other in the (x,y) domain is used. Points distant from each other result in more accurate parameters as they are less influenced by small positional errors. For instance, points describing the convex hull of the points can be used. By discarding points inside the convex hull of the points, considerable savings can be achieved.

To evaluate the performance of the RM estimator on the data sets in Table 8.1 when using the full combination of 3 correspondences out of the marked correspondences in the table, the parameters listed in Table 8.6 are obtained. The RMSE measures and computation time required to find the parameters for each set are also shown.

Table 8.6 The parameters estimated by the RM estimator along with RMSE measures and computation time for the five data sets in Table 8.1

The results obtained by the fast version of the RM estimator, which uses only the convex hull points in the reference image and the corresponding points are shown in Table 8.7. The fast RM estimator achieves a speed up factor of more than 1000 by introducing only small errors into the estimated parameters. The difference between the two is expected to reduce further with increasing n.

Table 8.7 Results obtained by the fast version of the RM estimator using only the convex-hull points in parameter estimation

Although the RM estimator has a theoretical breakdown point of 0.5, we see that in the scaled data set there are only 28 true correspondences from among the 78 marked correspondences in Table 8.1, showing that more than half of the correspondences are incorrect. However, since all residuals are within 10 pixels, the RM estimator has been able to estimate the parameters of the model.

8.6 LMS Estimator

The least median of squares (LMS) estimator finds the model parameters by minimizing the median of squared residuals [24]:

$$\min_{\hat{\mathbf{a}}}\bigl\{ \mathit{med}_i\bigl(r_i^2\bigr)\bigr\}.$$
(8.48)

When the residuals have a normal distribution with a mean of zero and when two or more parameters are to be estimated (m≥2), the breakdown point of the LMS estimator is [24]:

$$\varepsilon ^*={1\over n}\biggl( \bigg\lfloor {n\over 2}\bigg\rfloor -m+2 \biggr).$$
(8.49)

As n approaches ∞, the breakdown point of the estimator approaches 0.5.

By minimizing the median of squares, the process, in effect, minimizes the sum of squares of the smallest ⌊n/2⌋ absolute residuals. Therefore, first, the parameters of the model are estimated by OLS or a more robust estimator. Then, points that produce the ⌊n/2⌋ smallest magnitude residuals are identified and used in OLS to estimate the parameters of the model. The process is repeated until the median of squared residuals reaches a minimum.

Using the data sets shown in Table 8.1, the results in Table 8.8 are obtained. The process in each case takes from a few to several iterations to find the parameters. The LMS estimator has been able to find parameters between the transformed images and the original image that are as close to the ideal parameters as the parameters estimated by any of the estimators discussed so far.

Table 8.8 The parameters estimated by the LMS estimator using the data sets in Table 8.1

8.7 LTS Estimator

The least trimmed squares (LTS) estimator [26] is similar to the LMS estimator except that it uses fewer than half of the smallest squared residuals to estimate the parameters. LTS estimates the parameters by minimizing

$$\sum_{i=1}^h \bigl(r^2\bigr)_{i:n},$$
(8.50)

where mhn/2+1 and (r 2) i:n ≤(r 2) j:n , when i<j. The process initially estimates the parameters of the model by OLS or a more robust estimator. It then orders the residuals and identifies points that produce the h smallest residuals. Those points are then used to estimate the parameters of the model. The squared residuals are recalculated using all points and ordered. The process of selecting points and calculating and ordering the residuals is repeated. The parameters obtained from the points producing the h smallest residuals are taken as estimates to the model parameters in each iteration. The process is stopped when the hth smallest squared residual reaches a minimum.

The breakdown point of the LTS estimator is [26]:

$$\varepsilon ^*= \left\{ \begin{array}{l@{\quad}l}(h-m+1)/n & \mathrm{if}\ m\le h <\lfloor {{n+m+1}\over 2}\rfloor,\\[6pt](n-h+1)/n & \mathrm{if}\ \lfloor {{n+m+1}\over 2}\rfloor\le h \le n.\end{array}\right.$$
(8.51)

When n is not very large and if the number of parameters m is small, by letting h=n/2+1 we see that the breakdown point of this estimator is close to 0.5. When n is very large, by letting h=n/2, we see that irrespective of m a breakdown point close to 0.5 is achieved. Note that due to the ordering need in the objective function, each iteration of the algorithm requires O(nlog2 n) comparisons.

By letting h=n/4 and using the data in Table 8.1, we obtain the results shown in Table 8.9. Obtained results are similar to those obtained by the LMS estimator when using all the correspondences. When using only the correct correspondences, results obtained by the LTS estimator are slightly better than those obtained by the LMS estimator.

Table 8.9 Parameters estimated by the LTS estimator with h=n/4 using the data sets in Table 8.1

When the ratio of correct correspondences over all correspondences falls below 0.5, the parameters initially estimated by OLS may not be accurate enough to produce squared residuals that when ordered will place correct correspondences before the incorrect ones. Therefore, the obtained ordered list may contain a mixture of correct and incorrect correspondences from the very start. When the majority of correspondences is correct and there are no distant outliers, the residuals are ordered such that more correct correspondences appear at and near the beginning of the list. This enables points with smaller squared residuals to be selected, allowing more correct correspondences to participate in the estimation process, ultimately producing more accurate results.

8.8 R Estimator

A rank (R) estimator ranks the residuals and uses the ranks to estimate the model parameters [13]. By using the ranks of the residuals rather than their actual values, the influence of very distant outliers is reduced. By assigning weights to the residuals through a scoring function, the breakdown point of the estimator can be increased up to 0.5. Using a fraction α of the residuals in estimating the parameters of the model, Hossjer [9] reduced the influence of the 1−α largest magnitude residuals in parameter estimation. It is shown that a breakdown point of 0.5 can be achieved by letting α=0.5.

If R i is the rank of the ith largest magnitude residual |r i | from among n residuals and if b n (R i ) is the score assigned to the ith largest magnitude residual from a score generating function, then the objective function to minimize is

$${1\over n} \sum_{i=1}^n b_n(R_i)r_i^2, $$
(8.52)

which can be achieved by setting its gradient to zero and solving the obtained system of linear equations. Therefore,

$$\sum_{i=1}^nb_n(R_i)r_i x_{ik}=0,\quad k=1,\ldots,m.$$
(8.53)

This is, in effect, a WLS estimator where the weight of the residual at the i point is b n (R i ).

Given ranks {R i :i=1,…,n}, an example of a score generating function is

$$b_n(R_i)=h\bigl(R_i/(n+1)\bigr), $$
(8.54)

which maps the ranks to (0,1) in such a way that

$$\mathrm{sup}\bigl\{u; h(u)>\alpha \bigr\}=\alpha,\quad 0<\alpha \le 1. $$
(8.55)

For example, if α=0.25 and letting u=R i /(n+1), then when R i /(n+1)≤α the score is u, and when R i /(n+1)>α the score is 0.25. This scoring function, in effect, assigns a fixed weight to a certain percentage of highest magnitude residuals. Therefore, when α=0.25, the highest 75% residuals are given a fixed weight that is lower than what they would otherwise receive. The scoring function can be designed to assign decreasing scores to increasing residuals from a point and to assign a score of 0 to a percentage of the largest magnitude residuals. For example, consider the scoring function depicted in Fig. 8.4 with 0<αβγ≤1,

$$b_n(R_i)=\left\{\begin{array}{l@{\quad}l}R_i/(n+1), & \mathrm{if}\ R_i/(n+1) \le \alpha,\\[3pt]\alpha, &\mathrm{if}\ \alpha < R_i/(n+1)\le \beta,\\[3pt]\alpha[\gamma-R_i/(n+1)]/(\gamma-\beta), & \mathrm{if}\ \beta < R_i/(n+1) \ge \gamma,\\[3pt]0, & \mathrm{if}\ R_i/(n+1) > \gamma.\end{array}\right. $$
(8.56)
Fig. 8.4
figure 4

Plot of the scoring function of (8.56)

This scoring function discards the 100γ percentage of the points that produce the largest magnitude residuals. By discarding such points, the process removes the outliers. Hössjer [9] has shown that if the scoring function is nondecreasing, the process has a single global minimum. However, if the scoring function decreases in an interval, there may be more than one minima, and if the initial parameters estimated by OLS are not near the final parameters, the R estimator may converge to a local minimum rather than the global one.

To summarize, estimation by an R estimator involves the following steps:

  1. 1.

    Design a scoring function.

  2. 2.

    Estimate the model parameters by OLS or a more robust estimator and calculate the residuals.

  3. 3.

    Let initial weights at all points be 1/n.

  4. 4.

    Rank the points according to the magnitude of the weighted residuals.

  5. 5.

    Find the score at each point using the scoring function, and let the score represent the weight at the point.

  6. 6.

    Find the model parameters by the WLS estimator.

  7. 7.

    Estimate the new residuals at the points. If a minimum is reached in the sum of weighted square residuals, stop. Otherwise, go to Step 4.

Using the nondecreasing scoring function in (8.54), the results shown in Table 8.10 are obtained for the data sets in Table 8.1. Using the scoring function (8.56) with α=0.5, β=0.75, and γ=1.0, the results shown in Table 8.11 are obtained for the same data sets.

Table 8.10 Parameter estimation by the R estimator when using the scoring function of (8.54) with α=0.5 and the data sets in Table 8.1
Table 8.11 Parameter estimation by the R estimator when using the scoring function of (8.56) with α=0.5, β=0.75, and γ=1.0 and the data sets in Table 8.1

Similar results are obtained by the two scoring functions. Comparing these results with those obtained by previous estimators, we see that the results by the R estimator are not as good as those obtained by some of the other estimators when using the data sets in Table 8.1. By using the ranks of the residuals rather than their magnitudes, the process reduces the influence of distant outliers. The process, however, may assign large ranks to very small residuals in cases where a great portion of the residuals are very small. This, in effect, degrades the estimation accuracy. Therefore, in the absence of distant outliers, as is the case for the data sets in Table 8.1, the R estimator does not produce results as accurate as those obtained by LMS and LTS estimators.

8.9 Effect of Distant Outliers on Estimation

If a correspondence algorithm does not have the ability to distinguish inaccurate correspondences from incorrect ones, some incorrect correspondences (outliers) may take part in estimation of the model parameters. In such a situation, the results produced by different estimators will be different from the results presented so far. To get an idea of the kind of results one may get from the various estimators in the presence of distant outliers, the following experiment is carried out.

The 28 correct corresponding points in the original and scaled images marked with ‘+’ in Table 8.1 are taken. These correspondences are connected with yellow lines in Fig. 8.5a. In this correspondence set, points in the original set are kept fixed and points in the scaled set are switched one at a time until the breakdown point for each estimator is reached. To ensure that the outliers are far from the estimating model, the farthest points in the scaled set are switched. The correct correspondences, along with the outliers tested in this experiment, are shown in Figs. 8.5b–i. Red lines connect the incorrect correspondences and yellow lines connect the correct correspondences. Using point correspondences connected with yellow and red lines, the results shown in Table 8.12 are obtained by the various estimators. The green lines indicate the correct correspondences that are not used in the estimation process.

Fig. 8.5
figure 5

(a) 28 corresponding points in the coin image and its scaled version. (b)–(i) Introduction of 1, 2, 4, 6, 8, 9, 10, and 14 outliers into the correspondence set of (a). Red lines are the outliers (false positives) and green lines are the missed correspondences (false negatives). The yellow lines are the correct correspondences (true positives). The points connected with the yellow and red lines are used as corresponding points in the experiments

Table 8.12 Breakdown points for various estimators in the presence of distant outliers. Table entries show RMSE at the correct correspondences. The point at which a sharp increase in RMSE is observed while gradually increasing the number of outliers is the breakdown point. WLS-1 and WLS-2 imply WLS estimation without and with a cut-off threshold of 2 pixels, RM-1 and RM-2 imply the regular and the fast RM estimators, and R-1 and R-2 imply the R estimator with the non-decreasing scoring function of (8.54) with α=0.5 and the decreasing scoring function of (8.56) with α=0.25,β=0.5, and γ=0.75, respectively. The numbers in the top row show the number of distant outliers used in a set of 28 corresponding points

From the results in Table 8.12, we can conclude the following:

  1. 1.

    For the data set in Fig. 8.5a where no outliers are present and data are simply corrupted with random noise, OLS performs as good as any other estimator by finding the maximum likelihood estimation of the parameters.

  2. 2.

    Because OLS can break down with a single distant outlier, the estimators that depend on OLS to find the initial residuals or initial parameters can also break down with a single distant outlier. WLS and R-1 estimators have exhibited this characteristic when using the data sets containing one or more outliers.

  3. 3.

    To improve the accuracy of the estimators, a means to either eliminate some of the distant outliers, as done by R-2, or to estimate the initial model parameters more robustly is required.

  4. 4.

    When using the data sets in Fig. 8.5, the clear winner is the R-2 estimator, which uses the scoring function in (8.56). By effectively removing some of the outliers, ordering the rest, and using points with low squared residuals, this estimator has been able to find correct model parameters from data containing up to 50% of distant outliers (Fig. 8.5i). LTS with h=n/4 and LMS have also been able to perform well under distant outliers.

8.10 Additional Observations

For the data sets in Table 8.1, all tested estimators were able to find the parameters of the affine transformation to register the images with acceptable accuracies. These data sets do not contain distant outliers and errors at the points have distributions that are close to normal with a mean of 0. Among the estimators tested, RM, LMS, and LTS estimators produce the highest accuracies. Considering the high computational requirement of RM estimator, LMS and LTS stand out among the others in overall speed and accuracy in estimating model parameters when using the data sets of the kind shown in Table 8.1.

For the data sets of the kind depicted in Fig. 8.5, where distant outliers are present, results in Table 8.12 show that R estimator with the scoring function given in (8.56) is the most robust among the estimators tested, followed by LTS and LMS estimators. The OLS and WLS estimators are not to be used when the provided data contains distant outliers.

Although some estimators performed better than others on the limited tests performed in this chapter, it should be mentioned that one may be able to find a data set where any of the estimators can perform better than many of the other estimators. When data sets represent coordinates of corresponding points obtained by a point pattern matching algorithm, it is anticipated that the R-2 estimator will perform better than others when distant outliers are present, and LTS and LMS estimators will perform better than other estimators when the correspondences do not contain distant outliers.

When the ratio of outliers and inliers is small and the outliers are distant from the model, methods to remove the outliers have been developed. Hodge and Austin [8] provided a survey of such methods. Outlier detection, however, without information about the underlying model is not always possible especially when the number of outliers is nearly the same as the number of inliers, or when outliers are not very far from the model to be estimated. Robust estimators coupled with the geometric constraint that hold between images of a scene can determine model parameters in the presence of a large number of outliers and without use of outlier detection methods.

The list of estimators discussed in this chapter is by no means exhaustive. For a more complete list of estimators, the reader is referred to excellent monographs by Andrews et al. [3], Huber [12], Hampel et al. [7], Rousseeuw and Leroy [27], and Wilcox [36].

8.11 Principal Component Analysis (PCA)

Suppose feature vector x={x 0,x 1,…,x N−1} represents an observation from a phenomenon and there are m such observations: {x i:i=0,…,m−1}. We would like to determine an N×N matrix A that can transform x to a new feature vector y=A t x that has a small number of high-valued components. Such a transformation makes it possible to reduce the dimensionality of x while maintaining its overall variation.

Assuming each feature is normalized to have mean of 0 and a fixed scale, such as 1, then the expected value of yy t can be computed from

$$ \begin{array}{*{20}l} {E({\rm \mathbf{yy}}^t )} \hfill & {\, = E({\rm \mathbf{A}}^t {\rm \mathbf{xx}}^t {\rm \mathbf{A}})} \hfill \\ {} \hfill & {\, = {\rm \mathbf{A}}^t E({\rm \mathbf{xx}}^t ){\rm \mathbf{A}}} \hfill \\ {} \hfill & {\,{\rm = \mathbf{A}}^t \varSigma _x {\rm \mathbf{A}}} \hfill \\ \end{array} $$
(8.57)

where

$$\varSigma _x = \left[\begin{array}{c@{\quad}c@{\quad}c@{\quad}c}E(x_0x_0) & E(x_0x_1) & \dots & E(x_0x_{N-1}) \\E(x_1x_0) & E(x_1x_1) & \dots & E(x_1x_{N-1}) \\\cdot & \cdot & \dots & \cdot \\E(x_{N-1}x_0) & E(x_{N-1}x_1) & \dots & E({N-1}x_{N-1}) \\\end{array} \right]$$
(8.58)

is the covariance matrix with its ijth entry computed from

$$E(x_ix_j)={1\over m} \sum_{k=0}^{m-1}\bigl(x_i^kx_j^k\bigr).$$
(8.59)

By letting the eigenvectors of Σ x represent the columns of A, A t Σ x A will become a diagonal matrix with diagonal entries showing the eigenvalues of Σ x .

Suppose the eigenvalues of Σ x are ordered so that λ i λ i+1 for 0≤i<N−1 and eigenvectors corresponding to the eigenvalues are v 0,v 1,…,v N−1, we can then write

$$y_i=\mathbf{v}_i^t\mathbf{x},\quad i=0,\ldots,N-1.$$
(8.60)

If transformed features are known, the original features can be computed from

$$\mathbf{x}=\sum_{i=0}^{N-1}y_i\mathbf{v}_i.$$
(8.61)

An approximation to x using eigenvectors of Σ x corresponding to its n largest eigenvalues is obtained from

$$\hat{\mathbf{x}}=\sum_{i=0}^{n-1}y_i\mathbf{v}_i.$$
(8.62)

Squared error in this approximation will be [23, 34]

$$ \begin{array}{*{20}l} {E(\left\| {{\rm \mathbf{x}} - {\rm\mathbf{ \hat x}}} \right\|^2 )} \hfill & {\, = \sum\limits_{i = n}^{N - 1} {{\rm \mathbf{v}}_i^t {\rm \lambda }_i {\rm \mathbf{v}}_i } } \hfill \\ {} \hfill & { = \sum\limits_{i = n}^{N - 1} {{\rm \lambda }_i } } \hfill \\ \end{array} $$
(8.63)

for using y 0,y 1,…,y n−1 instead of x 0,x 1,…,x N−1.

Since the eigenvalues depend on the scale of features, the ratio measure [23]

$$r_n=\sum_{i=n}^{N-1}\lambda_i \Big/\sum_{i=0}^{N-1}\lambda_i$$
(8.64)

may be used as a scale-independent error measure to select the number of principal components needed to achieve a required squared error tolerance in approximation.

To summarize, following are the steps to reduce the dimensionality of feature vector x from N to n<N using a training data set containing m observations:

  1. 1.

    Estimate Σ x from the m observations.

  2. 2.

    Find eigenvalues and eigenvectors of Σ x . Order the eigenvalues from the largest to the smallest: λ 0λ 1≥⋯≥λ N−1. Note that eigenvalue λ i has an associating eigenvector, v i .

  3. 3.

    Find the largest n such that \(\sum_{i=n}^{N-1}\lambda_{i}<\varepsilon \), where ε is the required squared error tolerance.

  4. 4.

    Given a newly observed feature vector x, project x to the n-dimensions defined by the eigenvectors corresponding to the n largest eigenvalues of Σ x . That is compute \(y_{i}=\mathbf{v}_{i}^{t} \mathbf{x}\) for i=0,…,n−1. \(\bf y\) represents a point in n<N dimensions, thereby, reducing the dimensionality of x while ensuring the squared approximation error stays below the required tolerance.

PCA was first used by Pearson [21] to find the best-fit line or plane to high dimensional points. The best-fit line or plane was found to show the direction of most uncorrelated variation. Therefore, PCA transforms correlated values into uncorrelated values, called principal components. The components represent the direction of most uncorrelated variation, the direction of second most uncorrelated variation, and so on.

PCA is also called Karhunen–Loève (K–L) transform and Hotelling transform. Given a feature vector containing N features, in an attempt to create n<N new features that carry about the same variance from the linear combinations of the features, Hotelling [10] (also see [16, 17]) found the linear coefficients relating the original features to new ones in such a way that the first new feature had the largest variance. Then, the second feature was created in such a way that it was uncorrelated with the first and had as large a variance as possible. He continued the process until n new features were created. The coefficients of the linear functions defining a new feature in terms of the original features transform the original features to the new ones.

Rao [23] provided various insights into the uses and extensions of PCA. Watanabe [35] showed that dimensionality reduction by PCA minimizes average classification error when taking only a finite number of coefficients in a series expansion of a feature vector in terms of orthogonal basis vectors. He also showed that PCA minimizes the entropy of average square coefficients of the principal components. These two characteristics make PCA a very efficient tool for data reduction. The dimensionality reduction power of PCA using artificial and real data has been demonstrated by Kittler and Young [18]. For a thorough treatment of PCA and its various applications, see the excellent monograph by Jolliffe [16].

Since PCA calculates a new feature using all original features, it still requires high-dimensional data collection. It would be desirable to reduce the number of original features while preserving sufficient variance in collected features without changing the number of principal components. Jolliffe [14, 15] suggested discarding features that contributed greatly to the last few principal components, or selecting features that contributed greatly to the first few principal components. Therefore, if

$$\mathbf{y}=\mathbf{A}^t\mathbf{x},$$
(8.65)

or

$$y_i=\sum_{j=0}^{N-1} A_{ji}x_j, \quad i=0,\ldots,N-1,$$
(8.66)

where A ji denotes the entry at column i and row j in matrix A, then magnitude of A ji determines the contribution of x j to y i .

Since this method finds ineffective features in the original set by focusing on the principal components one at a time, the influence of an original feature on a number of principal components is not taken into consideration. Mao [19] suggested finding the contribution of an original feature on all selected principal components. The significance of an original feature on the selected n principal components is determined by calculating the squared error in (8.63) once using all features and another time using all features except the feature under consideration. The feature producing the least increase in error is then removed from the original set and the process is repeated until the squared error among the remaining features reaches a desired tolerance.

Since each transformed feature in PCA is a linear combination of the original features, the process detects only linear dependency between features. If dependency between features is nonlinear, nonlinear approaches [29, 30, 33] should be used to reduce the number of features.