This chapter first presents Lie operators for key points’ detection working in the affine plane. This approach is stimulated by certain evidence of the human visual system; therefore these Lie filters appear to be very useful for implementing, in the near future, a humanoid vision system.

The second part of the chapter presents an application of the quaternion Fourier transform for preprocessing for neuralcomputing. In a new way, the 1D acoustic signals of French spoken words are represented as 2D signals in the frequency and time domains. These kinds of images are then convolved in the quaternion Fourier domain with a quaternion Gabor filter for the extraction of features. This approach allows us to greatly reduce the dimension of the feature vector. Two methods of feature extraction are tested. The feature vectors were used for the training of a simple MLP, a TDNN, and a system of neural experts. The improvement in the classification rate of the neural network classifiers is very encouraging, which amply justifies the preprocessing in the quaternion frequency domain. This work also suggests the application of the quaternion Fourier transform for other image processing tasks.

The third part of the chapter presents the theory and practicalities of the quaternion wavelet transform. This work generalizes the real and complex wavelet transforms and derives a quaternionic wavelet pyramid for multiresolution analysis using the quaternionic phase concept. As an illustration, we present an application of the discrete QWT for optical flow estimation. For the estimation of motion through different resolution levels, we use a similarity distance evaluated by means of the quaternionic phase concept and a confidence mask.

1 Lie Filters in the Affine Plane

This section carries out the computations in the affine plane \({\mathcal{A}}_{{e}_{3}}({\mathcal{N}}^{2})\) for image analysis. We utilize the Lie algebra of the affine plane explained in Chap. 5 for the design of image filters to detect visual invariants. As an illustration, we apply these filters for the recognition of hand gestures.

1.1 The Design of an Image Filter

In the experiment, we used simulated images of the optical flow for two motions, a rotational and a translational motion (see Fig. 13.1a), and a dilation and a translational motion (see Fig. 13.2a). The experiment uses only bivector computations to determine the type of motion, the axis of rotation, and/or the center of the dilation.

Fig. 13.1
figure 1_13figure 1_13figure 1_13figure 1_13figure 1_13figure 1_13

Detection of visual invariants: (a) rotation (L r ) and translational flow (L x ) fields, (b) convolving via geometric product with a Gaussian kernel, (c) magnitudes of the convolution

Fig. 13.2
figure 2_13figure 2_13figure 2_13figure 2_13figure 2_13figure 2_13

Detection of visual invariants: (a) expansion (L s ) and translational flow (L x ) fields, (b) convolving via the geometric product with a Gaussian kernel, (c) magnitudes of the convolution

To study the motions in the affine plane, we used the Lie algebra of bivectors of the neutral affine plane \({\mathcal{A}}_{{e}_{3}}({\mathcal{N}}^{2})\), see Sect. 5.6. The computations were carried out with the help of a computer program that we wrote in C+ +. Each flow vector at any point x of the image was coded \(\mathbf{x} = x{e}_{1} + y{e}_{2} + {e}_{3} \in {\mathcal{N}}^{3}\). At each point of the flow image, we applied the commutator product of the six bivectors of Eq. 5.86. Using the resultant coefficients of the vectors, the computer program calculated which type of differential invariant or motion was present.

Figure 13.1b shows the result of convolving, via the geometric product, the bivector with a Gaussian kernel of size 5 ×5. Figure 13.1c presents this result using the output of the kernel. The white center of the image indicates the lowest magnitude. Figure 13.2 shows the results for the case of a flow which is expanding. Comparing Fig. 13.1c with Fig. 13.2c, we note the duality of the differential invariants: the centerpoint of the rotation is invariant, and the invariant of the expansion is a line.

1.2 Recognition of Hand Gestures

Another interesting application, suggested by the seminal paper of Hoffman [98], is to recognize a gesture using the key points of an image along with the previous Lie operators arranged in a detection structure, as depicted in Fig. 13.3. These Lie filters may be seen to be perceptrons, which play an important role in image preprocessing in the human visual system. It is believed [98] that during the first years of human life, some kinds of Lie operators combine to build the higher-dimensional Lie algebra SO(4,1).

Fig. 13.3
figure 3_13figure 3_13

Lie perceptron arrangement for feature detection

In this sense, we assume that the outputs of the Lie operators are linearly combined with an outstar output according to the following equation:

$$\begin{array}{rcl}{ O}_{\alpha }(x,y)& =& {w}_{1}{\mathcal{L}}_{x}(x,y) + {w}_{2}{\mathcal{L}}_{y}(x,y) + {w}_{3}{\mathcal{L}}_{r}(x,y) \\ & & +\,{w}_{4}{\mathcal{L}}_{s}(x,y) + {w}_{5}{\mathcal{L}}_{b}(x,y) + {w}_{6}{\mathcal{L}}_{B}(x,y),\end{array}$$
(13.1)

where the weights w i can be adjusted if we apply a supervised training procedure. If the desired feature or key point α at the point (i, y) is detected, the output O α(x, y) goes to zero.

Figure 13.4a shows hand gestures given to a robot. By applying the Lie perceptron arrangement to the hand region (see Fig. 13.4b), a robot can detect whether it should follow, stop, or move in circles. Table 13.1 presents the weights, w i , necessary for detecting the three gestures. Detection tolerance is computed as follows:

$$\begin{array}{rcl}{ O}_{\alpha }(x,y) \leq \min +\frac{(\max -\min )\mbox{ Tolerance}} {100} \, \rightarrow \,\mbox{ detection\, of\, a\, feature \, type},& & \\ \end{array}$$

where min and max correspond to the minimal and maximal Lie operator outputs, respectively.

Fig. 13.4
figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13figure 4_13

Gesture detection: (a) (top images) gestures for robot guidance (follow, stop, and explore), and (b) (lower images) detected gestures by the robot vision system using Lie operators

Table 13.1 Weights of the Lie perceptron arrangement for the detection of hand gestures

2 Representation of Speech as 2D Signals

In this work, we use the psycho-acoustical model of a loudness meter suggested by Zwicker [199]. This meter model is depicted in Fig. 13.5a. The output is a 2D representation of sound loudness over time and frequency. The motivation of this work is to use the loudness imagein order to take advantage of the variation in time of the frequency components of the sound. A brief explanation of this meter model follows.

Fig. 13.5
figure 5_13figure 5_13figure 5_13figure 5_13figure 5_13figure 5_13figure 5_13figure 5_13

From the top: (a) The psycho-acoustical model of loudness meter suggested by Zwicker, (b) 2D representation (vertical outputs of the 20 filters, horizontal axis is the time), (c) a 3D energy representation of (b) where the energy levels are represented along the z-coordinate, (d) main loudness signal or total output of the psycho-acoustical model (the sum of the 20 channels)

The sound pressure is picked up by a microphone and converted to an electrical signal, which in turn is amplified. Thereafter, the signal is attenuated to produce the same loudness in a diffuse and free-sound field. In order to take into account the frequency dependence of the sound coming from the exterior and passing through the outer ear, a transmission factor is utilized. The signal is then filtered by a filter bank with filter bands dependent on the critical band rate (Barks). In this work, we have taken only the first 20 Barks, because this study is restricted to the band of speech signals. At the output of the filters, the energy of each filter signal is calculated to obtain the maximal critical band level varying with time. Having 20 of these outputs, a 2D sound representation can be formed as presented in Fig. 13.5b. At each filter output, the loudness is computed taking into account temporal and frequency effects according to the following equation:

$$N' = 0.068{ \frac{{E}_{TQ}} {s \cdot {E}_{0}}}^{0.25}\left [{\left (1 - s + s \cdot \frac{E} {{E}_{TQ}}\right )}^{0.25} - 1\right ] \frac{\mbox{ sone}} {\mbox{ Bark}},$$

where E TQ stands for the excitation level at a threshold level in quite, E is the main excitation level, E 0 is the excitation that corresponds to the reference intensity \({I}_{0} = 1{0}^{-12} \frac{\mbox{ W}} {{\mbox{ m}}^{2}}\), s stands for the masking index, and finally one sone is equivalent to 40 phones. To obtain the main loudness value, the specific loudness of each critical band is added. Figure 13.5c depicts the time–loudness evolution in 3D, where the loudness levels are represented along the z-coordinate.

3 Preprocessing of Speech 2D Representations Using the QFT and Quaternionic Gabor Filter

This section presents two methods of preprocessing. The first is a simple one that can be formulated; however, we show that the extracted features do not yield a high classification rate, because the sounds of the consonants of the phonemes are not well recognized.

3.1 Method 1

A quaternion Gabor filter is used for the preprocessing (see Fig. 13.6a). This filter is convolved with an image of 80 × 80 pixels \(\bigg( (5 \times 16 [channels ]) \times (800 / 10 [{\rm ms}]) \bigg)\) using the quaternion Fourier transform. In order to have for the QFT a 80 ×80 square matrix for each of the 16 channels, we copied the rows 5 times, for 80 rows. We used only the first approximation of the psycho-acoustical model, which comprises only the first 16 channels. The feature extraction is done in the quaternion frequency domain by searching features along the lines of expected maximum energy (see Fig. 13.6b). After an analysis of several images, we found the best place for these lines in the four images of the quaternion image. Note that this approach first expands the original image 80 ×80 = 1600 to 4 ×1600 = 6400 and then reduces this result to a feature vector of length 16 [199] ×8 [199] = 128. This clearly explains the motivation of our approach; we use a four-dimensional representation of the image for searching features along these lines; as a result we can effectively reduce the dimension of the feature vector. Figure 13.6c shows the feature vectors for the phonemes of 10 decimal numbers spoken by 29 different female or 20 different male speakers. For example, for the neuf of Fig. 13.6c, we have stacked the feature vectors of the 29 speakers, as Fig. 13.6d shows. The consequence of considering the first 16 channels was a notorious loss in high-frequency components, making the detection of the consonants difficult. We also noticed that in the first method, using a wide quaternion Gabor filter, even though higher levels of energy were detected, the detection of the consonants did not succeed. Conversely, using a narrow filter, we were able to detect the consonants, but the detected information of the vowels was very poor. This indicated that we should filter only those zones where changes of sound between consonants and vowels take place. The second method, presented next, is the implementation of this key idea.

Fig. 13.6
figure 6_13figure 6_13figure 6_13figure 6_13figure 6_13figure 6_13figure 6_13figure 6_13

Method 1: (from upper left corner) (a) quaternion Gabor filter, (b) selected 128 features according to energy level (16 channels and 8 analysis lines in the four r, i​, j​, k images, 16 ×8 = 128), (c) stack of feature vectors for 10 numbers and 29 speakers (the ordinate axis shows the words of the first 10 French numbers and the abscissa the time in milliseconds), (d) the stack of the French word neuf spoken by 29 speakers (also presented in (c))

3.2 Method 2

The second method does not convolve the whole image with the filter. This method uses all the 20 channels of the main loudness signal. First, we detect the inflection points where the changes of sounds take place, particularly those between consonants and vowels. These inflection points are found by taking the first derivative with respect to the time of the main loudness signal (see Fig. 13.7a for the sept). Let us imagine that someone says ssssseeeeeeepppppth, with the inflection points, one detects two transition regions of 60 ms, one for se and another for ept (see Fig. 13.7b). By filtering these two regions with a narrow quaternion Gabor filter, we split each region to another two, separating s from e (first region) and e from pth (for the second region). The four strips represent what happens before and after the vowel e (see Fig. 13.8a). The feature vector is built by tracing a line through the maximum levels of each strip. We obtain four feature columns: column 1 for s, column 2 for e, column 3 for e, and column 4 for pth (see Fig. 13.8b). Finally, one builds one feature vector of length 80 by arranging the four columns (20 features each). Note that the second method reduces the feature vector length even more, from 128 to 80.

Fig. 13.7
figure 7_13figure 7_13figure 7_13figure 7_13figure 7_13figure 7_13

(a) Method 2: the main loudness signals for sept and neuf spoken by two speakers, (b) determination of the analysis strips using the lines of the inflection points (20 filter responses at the ordinate axis and the time in milliseconds at the abscissa axis), (c) narrow-band quaternion Gabor filter

Fig. 13.8
figure 8_13figure 8_13figure 8_13figure 8_13

Method 2: from the left (a) selected strips of the quaternion images for the words sept and neuf spoken by two speakers; (b) zoom of a strip of the component j of the quaternionic image and the selected 4 columns for feature extraction (20 channels × 4 lines = 80 features)

Figure 13.9a shows the feature vectors of length 80 for the phonemes of the 10 decimal numbers spoken by 29 different female or male speakers. For example, for the neuf of Fig. 13.7b, we have stacked the feature vectors of the 29 speakers, as shown in Fig. 13.9b.

Fig. 13.9
figure 9_13figure 9_13figure 9_13figure 9_13

Method 2: (a) stack of feature vectors of 29 words spoken by 29 speakers, (b) the stack for the word neuf spoken by 29 speakers

4 Recognition of French Phonemes Using Neurocomputing

The extracted features using method 1 were used for training a multi-layer perceptron, depicted in Fig. 13.10a. We used a training set of 10 male and 9 female speakers; each one spoke the first 10 French numbers zero, un, deux, trois, quatre, cinq, seize, sept, huit, neuf ; thus, the training set comprises 190 samples. After the training, a set of 100 spoken words was used for testing method 1 (10 numbers spoken by 5 male and 5 female speakers makes 100 samples). The recognition percentage achieved was 87%. For the second approach, the features were extracted using method 2. The structure used for recognition consists of an assembly of three neural experts regulated by a neural network arbitrator. In Fig. 13.10b, the arbitrator is called main and each neural expert is dedicated to recognize a certain group of spoken words. The recognition percentage achieved was 98%. The great improvement in the recognition rate is mainly due to the preprocessing of method 2 and the use of a neural expert system. We carried out a similar test using a set of 150 training samples (10 male and 5 female speakers) and 150 recall samples (5 male and 10 female speakers); the recognition rate achieved was a bit lower than 97%. This may be due to the lower number of female speakers during training and their higher number in the recall. This means that the system specializes itself better for samples spoken by males.

Fig. 13.10
figure 10_13figure 10_13figure 10_13figure 10_13

(a) Neural network used for method 1, (b) group of neural networks used for method 2

In order to compare with a standard method used in speech processing, we resorted to the time delay neural network (TDNN) [136]. We used the stuttgart neural network simulator (SNNS). The input data in the format required for the SNNS was generated using Matlab. The selected TDNN architecture was as follows: (input layer) 20 inputs for the 20 features that code the preprocessed spoken numbers, 4 delays of length 4; (hidden layer) 10 units, 2 delays of length 2; (output layer) 10 units for the 10 different numbers. We trained the TDNN using 1,500 samples, 150 samples for each spoken number. We used the SNNS learning function timedelaybackprop and for updating the function Timedelay order. During the learning, we carried out 1,000 cycles, that is, 1,000 iterations for each spoken number. We trained the neural network in two ways: (i) one TDNN with 10 outputs; and (ii) a set of three TDNNs, where each neural network was devoted to learn a small disjoint set of spoken numbers. Of the two methods, the best result was obtained using the first TDNN, which managed to get a 93.8% rate of recognition success. Table 13.1 summarizes the test results. The letters F and M stand for female and male speakers, respectively.

Table 13.2 shows that the best architecture was the one composed of a set of neural experts (method 2). Clearly, the TDNN performed better than the MLP. Taking into account the performance of method 2 and of the TDNN, we find that the preprocessing using the quaternion Fourier transform played a major role.

Table 13.2 Comparison of the methods (M = male and F = female)

5 Application of QWT

The motion estimation using quaternionic wavelet filters is inferred by means of the measurements of phase changes of the filter outputs. The accuracy of such an estimation depends on how well our algorithm deals with the correspondence problem, which can be seen as a generalization of the aperture problem depicted in Fig. 13.11.

Fig. 13.11
figure 11_13figure 11_13

The aperture problem

This section deals with the estimation of the optical flow in terms of the estimation of the image disparity. The disparity of a couple of images f(x, y)1 and f(x, y)2 is computed by determining the local displacement, which satisfies f(x, y)1 = f(x + d x , y + d y )2 = f(x + d), where d = (d x , d y ). The range of d has to be small compared with the image size; thus, the observed features always have to be within a small neighborhood in both images. In order to estimate the optical flow using the quaternionic wavelet pyramid, first we compute the quaternionic phase at each level, then the confidence measure, and finally the determination of the optical flow. The first two are explained in the next sections. Thereafter, we show two examples of optical flow estimation.

5.1 Estimation of the Quaternionic Phase

The estimation of the disparity using the concept of local phase begins with the assumption that a couple of successive images are related as follows:

$${f}_{1}(x) = {f}_{2}(x + \mathbf{d}(x)),$$
(13.2)

where d(x) is the unknown vector. Assuming that the phase varies linearly (here the importance of shifting-invariant filters), the displacement d(x) can be computed as

$${d}_{x}(x) = \frac{{\phi }_{2}(x) - {\phi }_{1}(x) + n(2\pi + k)} {2\pi {u}_{\textrm{ ref}}},\qquad {d}_{y}(x) = \frac{{\theta }_{2}(x) - {\theta }_{1}(x) + m\pi } {2\pi {v}_{\textrm{ ref}}},$$
(13.3)

with reference frequencies \((u_{\rm ref}, v_{\rm ref})\) that are not known a priori. Here ϕ(x) and θ(x) are the first two components of the quaternionic local phase of the quaternionic filter. We choose \(n,m \in \mathbb{Z}\), so that d x and d y are within a valid range. Depending on m, k is defined as

$$\begin{array}{rcl} k = \left \{\begin{array}{l} 0,\quad \mathrm{if}\ m\ \mathrm{iseven},\\ 1, \quad \mathrm{if } \ m\ \mathrm{isodd }. \end{array} \right.& &\end{array}$$
(13.4)

A good disparity estimation is achieved if (u ref, v ref) are chosen well. There are two methods of dealing with the problem: (i) the constant model, where u ref and v ref are chosen as the central frequencies of the filters; (ii) the model for the complex case called the local model, which supposes that the phase takes the same value Φ 1(x) = Φ 2(x + d) in two corresponding points of both images. Thus, one estimates d by approximating Φ 2 via a first-order Taylor’s series expansion about x:

$$\begin{array}{rcl}{ \Phi }_{2}(\mathbf{x} + \mathbf{d}) \approx {\phi }_{2}(\mathbf{x}) + (\mathbf{d} \cdot \bigtriangledown ){\phi }_{2}(\mathbf{x}),& &\end{array}$$
(13.5)

where we call Φ = (ϕ, θ). Solving Eq. 13.5 for d, we obtain the estimated disparity of the local model. In our experiments, we assume that ϕ varies along the x-direction and θ along y. Using this assumption, the disparity (Eq. 13.3) can be estimated using the following reference frequencies:

$$\begin{array}{rcl}{ u}_{\textrm{ ref}} = \frac{1} {2\pi } \frac{\partial {\phi }_{1}} {\partial x} (x),\qquad {v}_{\textrm{ ref}} = \frac{1} {2\pi } \frac{\partial {\theta }_{1}} {\partial y} (x).& &\end{array}$$
(13.6)

In the locations where u ref and v ref are equal to zero, Eqs. 13.6 are undefined. One can neglect these localities using a sort of confidence mask. This is explained in the next section.

5.2 Confidence Interval

In neighborhoods where the energy of the filtered image is low, one cannot estimate the local phase monotonically; similarly, at points where the local phase is zero, the estimation of the disparity is impossible because the disparity is computed by the phase difference divided by the local frequency. In this regard, we need a confidence measurement that indicates the quality of the estimation at a given point. For this purpose, we can design a simple binary confidence mask. Using complex filters, this can be done depending on whether the filter response is reasonable.

In the case of quaternionic filters, we need two confidence measurements for ϕ and θ. According to the multiplication rule of the quaternions, we see that the first two components of the quaternionic phase are defined for almost each point as

$$\phi (\mathbf{x}) {=\arg }_{i}({k}^{q}(\mathbf{x})\beta ({k}^{q}(\mathbf{x}))),\,\,\,\,\theta (\mathbf{x}) {=\arg }_{ j}(\alpha ({k}^{q}(\mathbf{x})){k}^{q}(\mathbf{x})),$$
(13.7)

where k q is the quaternionic filter response, and arg i and arg j were defined in Eq. 3.34. The projections of the quaternionic filter are computed according to Eq. 3.35. Using these angles and projections of the response of the quaternionic filter, we can now extend the well-known confidence measurement of the complex filters

$$\mathrm{Conf}(\mathbf{x}) = \left \{\begin{array}{l} 1,\qquad \mathrm{if}\vert k(\mathbf{x})\vert > \tau,\\ 0, \qquad \mathrm{otherwise } \end{array} \right.$$
(13.8)

to the quaternionic case:

$${ C}_{h}({k}^{q}(\mathbf{x})) = \Big\{\begin{array}{l} 1\qquad \mathrm{if}\qquad \mathrm{{mod}}_{i}({k}^{q}(bfx)\beta ({k}^{q}(\mathbf{x}))) > \tau, \\ 0\qquad \mathrm{otherwise} \end{array}$$
(13.9)
$${ C}_{v}({k}^{q}(\mathbf{x})) = \Big\{\begin{array}{l} 1\qquad \mathrm{if}\qquad \mathrm{{mod}}_{j}(\alpha ({k}^{q}(\mathbf{x})){k}^{q}(\mathbf{x}) > \tau, \\ 0\qquad \mathrm{otherwise}\end{array}.$$
(13.10)

Given the outputs of the quaternionic filters \({k}_{1}^{q}\) and \({k}_{2}^{q}\) of two images, we can implement the following confidence masks:

$$\mathrm{{Conf}}_{h}(\mathbf{x}) = {C}_{h}({k}_{1}^{q}(\mathbf{x})){C}_{ h}({k}_{2}^{q}(\mathbf{x})),$$
(13.11)
$$\mathrm{{Conf}}_{v}(\mathbf{x}) = {C}_{v}({k}_{1}^{q}(\mathbf{x})){C}_{ v}({k}_{2}^{q}(\mathbf{x})),$$
(13.12)

where Conf h and Conf v are the confidence measurements for the horizontal and vertical disparity, respectively.

However, one cannot use these measurements, because they are fully identical. This can easily be seen due to the simple identity for any quaternion q

$$\mathrm{{mod}}_{i}(q\beta (\bar{q})) =\mathrm{{ mod}}_{j}(\alpha (\bar{q})q),$$
(13.13)

which can be checked straightforwardly by making explicit both sides of the equation. We can conclude, using the previous formulas, that for the responses of two quaternionic filters of either image, the horizontal and vertical confidence measurements will be identical:

$$\mathrm{{Conf}}_{h}(\mathbf{x}) =\mathrm{{ Conf}}_{v}(\mathbf{x}).$$
(13.14)

For a more detailed explanation of the quaternionic confidence interval, the reader can resort to the Ph.D. thesis of Bülow [29].

5.3 Discussion on Similarity Distance and the Phase Concept

In the case of the complex conjugated wavelet analysis, the similarity distanceS j ((x, y), (x′, y′)) for any pair of image points on the reference image f(x, y) and the matched image (after a small motion) f′(x′, y′) is defined using the six differential components \({D}_{j,p},\tilde{{D}}_{j,p},\,p = 1,2,3\), of Eq. 8.76. The best match will be achieved by finding u, v, which minimize

$${ \mathrm{min}}_{u,v}{S}_{j}((k,l),(k' + u,l' + v)).$$
(13.15)

The authors [144, 127] show that, with the continuous interpolation of Eq. 13.15, one can get a quadratic equation for the similarity distance:

$${S}_{j,p}((k,l),(k' + u,l' + v)) = {s}_{1}{(u - {u}_{o})}^{2} + {s}_{ 3}(u - {u}_{o})(v - {v}_{o}) + {s}_{4},\qquad$$
(13.16)

where u 0, v 0 is the minimum point of the similarity distance surface S j, p , s 1, s 2, s 3 are the curvature directions, and s 4 is the minimum value of the similarity distance S j . The parameters of this approximate quadratic surface provide a sub-pixel-accurate motion estimate and an accompanying confidence measure. By its utilization in the complex-valued discrete wavelet transform hierarchy, the authors claim to handle the aperture problem successfully.

In the case of the quaternion wavelet transform, we directly use the motion information captured in the three phases of the detail filters. The approach is linear, due to the linearity of the polar representation of the quaternion filter. The confidence throughout the pyramid levels is assured by the bilinear confidence measure given by Eqs. 13.12, and the motion is computed by the linear evaluation of the disparity equations (13.3). The local model approach helps to estimate the u ref and v ref by evaluating Eqs. 13.6. The estimation is not of a quadratic nature, like Eq. 13.16. An extension to this kind of quadratic estimation of motion constitutes an extra avenue for further improvement of the application of the QWT for multiresolution analysis.

Fleet [59] claims that the phase-based disparity estimation is limited to the estimation of the components of the disparity vector that is normal to an oriented structure in the image. These authors believe that the image is intrinsically one-dimensional almost everywhere. In contrast to the case of the quaternion confidence measure (see Eq. 13.12), we see that it singles out those regions where horizontal and vertical displacement can reliably be estimated simultaneously. Thus, by using the quaternionic phase concept, the full displacement vectors are evaluated locally at those points where the aperture problem can be circumvented.

5.4 Optical Flow Estimation

In this section, we show the estimation of the optical flow of the Rubik cube and Hamburg taxi image sequences. We used the following scaling and wavelet quaternionic filters

$${h}^{q} = g(x,{\sigma }_{ 1},\epsilon )\exp \left (\mathbf{\mathit{i}}\frac{{c}_{1}{\omega }_{1}x} {{\sigma }_{1}} \right )\exp \left (\mathbf{\mathit{j}}\frac{{c}_{2}\epsilon {\omega }_{2}y} {{\sigma }_{1}} \right ),$$
(13.17)
$${g}^{q} = g(x,{\sigma }_{ 2},\epsilon )\exp \left (\mathbf{\mathit{i}}\frac{\tilde{{c}}_{1}\tilde{{\omega }}_{1}x} {{\sigma }_{2}} \right )\exp \left (\mathbf{\mathit{j}}\frac{\tilde{{c}}_{2}\epsilon \tilde{{\omega }}_{2}y} {{\sigma }_{2}} \right ),$$
(13.18)

with \({\sigma }_{1} = \frac{\pi } {6}\) and \({\sigma }_{2} = \frac{5\pi } {6}\) so that the filters are in quadrature and \({c}_{1} =\tilde{ {c}}_{1} = 3\), ω1 = 1 and ω2 = 1, and ε = 1. The resulting quaternionic mask will also be subsampled through the levels of the pyramid. For the estimation of the optical flow, we use two successive images of the image sequence. Thus, two quaternionic wavelet pyramids are generated. For our examples, we computed four levels. According to Eq. 8.88, at each level of each pyramid we obtain 16 images, accounting for the 4 quaternionic outputs (approximation Φ and the details Ψ 1 (horizontal), Ψ 2 (vertical), Ψ 3 (diagonal)). The phases are evaluated according to Eqs. 3.4. Figure 13.13 shows the magnitudes and phases obtained at level j using two successive Rubik’s images.

Fig. 13.13
figure 13_13figure 13_13

The magnitudes and phase images for Rubik’s sequence at a certain level j: (upper row) the approximation Φ and (next rows) the details Ψ 1 (horizontal), Ψ 2 (vertical), Ψ 3 (diagonal)

After we have computed the phases, we proceed to estimate the disparity images using Eqs. 13.3, where the reference frequencies u and v are calculated according to Eq. 13.6. We apply the confidence mask according to the guidelines given in Sect. 13.5.2 and shown in Fig. 13.14a.

Fig. 13.14
figure 14_13figure 14_13figure 14_13figure 14_13

(a) Confidence mask, (b) estimated optical flow

After the estimation of the disparity has been filtered by the confidence mask, we proceed to estimate the optical flow at each point computing a velocity vector in terms of the horizontal and vertical details. Now, using the information from the diagonal detail, we adjust the final orientation of the velocity vector. Since the procedure starts from the higher level (top-down), the resulting matrix of the optical flow vectors is expanded in size equal to the next level, as shown for one level in Fig. 13.12. The algorithm estimates the optical flow at the new level. The result is compared with the one of the expanded previous levels. The velocity vectors of the previous level fill gaps in the new level.

Fig. 13.12
figure 12_13figure 12_13

Procedure for the estimation of the disparity at one level

This procedure is continued until the bottom level. In this way, the estimation is refined smoothly, and the well-defined optical flow vectors are passed from level to level, increasing the confidence of the vectors at the finest level. It is unavoidable for some artifacts to survive at the final stage. A final refinement can be applied, imposing a magnitude thresholding and certainly deleting isolated small vectors. Figure 13.14b presents the computed optical flow for an image couple of the Rubik’s image sequence.

Next, we present the computations of optical flow of the Hamburg taxi, using the QWT. Figure 13.15a shows a couple of images: Fig. 13.15b shows the confidence matrices at the four levels and Fig. 13.15c presents the fused horizontal and vertical disparities and the resulting optical flow at a high level of resolution. For comparison, we utilized the method of Bernard [23], which uses real-valued discrete wavelets. In Fig. 13.16, we can see that our method yields better results. Based on the results, we can conclude that our procedure using a quaternionic wavelet pyramid and the phase concept for the parameter estimation works very well in both experiments. We believe that the computation of the optical flow using the quaternionic phase concept should be considered by researchers and practitioners as an effective alternative for multiresolution analysis.

Fig. 13.15
figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13figure 15_13

(a) (top row) A couple of successive images, (b) (middle row) confidence matrices at four levels, (c) (bottom) horizontal and vertical disparities and optical flow

Fig. 13.16
figure 16_13figure 16_13figure 16_13figure 16_13figure 16_13figure 16_13figure 16_13figure 16_13

Optical flow computed using discrete real-valued wavelets. (Top) Optical flow at fourth level. (Bottom) Optical flow at third, second, and first levels

6 Conclusion

In the first section, the chapter shows the application of feature detectors using Lie perceptrons in the 2D affine plane. The filters are not only applicable to optic flow, they may also be used for the detection of key points. Combinations of Lie operators yield to complex structures for detecting more sophisticated visual geometry.

The second part presents the preprocessing for neuralcomputing using the quaternion Fourier transform. We applied this technique for the recognition of the first 10 French numbers spoken by different speakers. In our approach, we expanded the signal representation to a higher-dimensional space, where we search for features along selected lines, allowing us to surprisingly reduce the dimensionality of the feature vector. The results also show that the method manages to separate the sounds of vowels and consonants, which is quite rare in the literature of speech processing. For the recognition, we used a neural experts architecture.

The third part of this chapter introduced the theory and practicalities of the QWT, so that the reader can apply it to a variety of problems making use of the quaternionic phase concept. We extended Mallat’s multiresolution analysis using quaternion wavelets. These kernels are more efficient than the Haar quaternion wavelets. A big advantage of our approach is that it offers three phases at each level of the pyramid, which can be used for a powerful top-down parameter estimation. As an illustration in the experimental part, we apply the QWT for optical flow estimation.

We believe that this chapter can be very useful for researchers and practitioners interested in understanding and applying the quaternion Fourier and wavelet transforms.