1 Introduction

Obtaining image sequences for mapping has become an easier task over the last years, thanks to the rapid progress on optical sensors and robotic platforms. One of the steps in the processing pipeline of image sequences (e.g., mosaicing Xu 2013; Fang et al. 2010 and 3D reconstruction Koutsoudis et al. 2014) is usually 2D image registration, which is the process of stitching two or more views of the same scene taken from different viewpoints. It is one of the crucial steps and plays a very important role in a vast amount of computer vision processes. Image registration is mainly accomplished by using either feature or featureless methods. Over the last decade, impressive progress (such as Scale Invariant Feature Transform (SIFT) (Lowe 2004), Speeded Up Robust Features (SURF) (Bay et al. 2006, etc.) has been made on detecting and extracting distinctive salient points (also known as features) in the image, which leads to foster and promote the usage of feature-based methods more than the featureless intensity based methods. Feature-based methods follow a pipeline that is composed of (1) feature detection, (2) feature description and (3) descriptor matching. The matching of features is commonly done by computing a function that depends on the Euclidean distance between their descriptors. The initial matching frequently produces some incorrect correspondences, which are called outliers. Outliers are typically identified and removed with a robust estimation algorithm (e.g., RANSAC Fischler and Bolles 1981). These probabilistic methods might fail when the scene presents Visual Aliasing, that is highly repetitive textures and structural similarities. This failure often leads to mismatched image pairs (e.g., Fig. 1), although the resulting homography accuracy is within the error bounds handled by RANSAC. While processing such a sequence of images with repetitive texture, the probability of occurrence of such cases becomes high and these mismatched image pairs provide misleading information about the camera trajectory and prevents an accurate final outcome.

Fig. 1
figure 1

An example of mismatched image pair due to visual aliasing. Twenty-eight inliers are detected and the corresponding homography is computed using DLT (Hartley and Zisserman 2004) with an average error of 3.41 pixels in one image. Note that the image hardly overlap

In this paper, we present a method to identify such image pairs, using loop constraints. Our pipeline for identifying mismatched image pairs embedded in FIM framework is illustrated in Fig. 2. Our method relies on the fact that images forming a loop (a cycle in graph-based notation) should have identity mapping when all the homographies are multiplied (see Fig. 3). If there is a mismatched image pair or pairs in a cycle, then the error (computed as a deviation from identity) will grow drastically. Errors computed by using a different number of cycles involving the same image pair can be used to indicate a false match positive for that image pair.

Fig. 2
figure 2

Pipeline of the proposed mismatched image pairs identification method embedded into a traditional FIM framework

Fig. 3
figure 3

An illustrative example of a cycle showing a closed-loop. The loop should the following equality: \(\mathbf I ={^{1}\mathbf H _{2}}\cdot {^{2}\mathbf H _{3}}\cdot {^{3}\mathbf H _{4}}\cdot {^{4}\mathbf H _{5}}\cdot {^{5}\mathbf H _{1}}\). However, this equality does not often hold due to error accumulation. The longer the loop, the more error it accumulates. When there is a wrong edge in a cycle (resulting from an image matching wrongly accepted as valid), then effect on the error is much large than the effect of accumulation

Our algorithm starts by finding cycles (loops) in the sequence. Next, errors are computed for each pair inside the cycles (see Table 1). To have meaningful descriptive statistics, this process is repeated several times with different cycles. Cycles are obtained by generating different cycles bases. After obtaining a series of errors for each image pair, we compute error histograms and try to identify the different ones using a previously trained classifier. To validate our proposal, we present experimental results with artificially added mismatched image pairs on underwater image sequences as well as real sequences from panoramic image sequences.

Finding such mismatched pairs has been studied in the global alignment literature by iteratively removing the image pair with the highest residual error (Martinec and Pajdla 2007). Although this might be feasible for small and/or sparser datasets, repeating the global alignment may lead to a high computational cost. The concept of ”Missing correspondences” for an image pair using a third image was proposed in Zach et al. (2008) to identify mismatched image pairs. A similar solution was proposed using this missing correspondences analysis and timestamps of images in Roberts et al. (2011). This approach is only feasible for sequentially acquired image sequences due to the use of timestamps. Our method shares some similarities and builds upon the method proposed by Zach et al. (2010). The main differences are that we do not set any limit on the length (number of vertices) of a cycle and we do not use any empirical values for the parameters of the error distributions of image pairs. Instead, we employ a machine learning technique, operating on error histograms to identify mismatched image pairs.

The rest of the paper is structured as follows. The following section provides a brief background about graph-based topology representation and details our proposal. Experimental results are illustrated and discussed in Sect. 3. Finally, we present our conclusions in Sect. 4.

2 Mismatch identification using cycles in graph-based topology representation

A graph \(G\,=\,(V,E)\) consists of vertices (nodes) and edges (links) between vertices. V is the set of vertices while \(E \subset V\times V\) represents the set of edges. The total number of vertices \(n\,=\,\left| {V}\right|\) defines the order of the graph while the total number of edges \(m\,=\,\left| {E}\right|\) is the size of the graph (Skiena 1990). A cycle in a graph is defined as a subgraph in which every vertex has even degree, which is defined as the total number of edges incident to the vertex. A cycle basis is a list of cycles in the graph, with each cycle expressed as a list of vertices (Kavitha et al. 2009). These cycles define a basis for the cycle space of the graph so that every other cycle in the graph can be obtained from the cycle basis using only symmetric differences, which is defined as the set of elements that are in either of the sets but not in their intersection. Commonly used methods to compute cycle basis are based on spanning trees. The spanning tree of a connected graph is a tree that connects all the vertices together (Skiena 1990). One graph can have various different spanning trees. The Minimum Spanning Tree (MST) is a spanning tree whose edges have a total weight less than or equal to the total weight of every other spanning tree of the graph. The cycle basis obtained by using the MST is called the minimum weight spanning tree basis, which is computed by adding an edge to the MST and removing the path connecting the endpoints of the edge. The total number of cycles in the basis for a connected graph is \(t_c\,=\,\left( {m} - ({n} - 1)\right)\). If the graph is not connectedFootnote 1 then total number of cycles in the basis is computed as \(\left( m-\, (n\, -\, 1)\, + \,(p\, -\, 1)\right)\), where p is the number of connected components. Every cycle that is not in the basis can be written as a linear combination of two or more cycles in the basis.

We use graph-based topology representation (Sawhney et al. 1998) for our problem where images are denoted as vertices and successfully registered images are connected with edges as this representation enables us to use some existing graph theory algorithms for finding cycles. In this work, having at least 20 correspondences after rejecting the outliers using RANSAC (Fischler and Bolles 1981) is considered as a “successful registration”. As the outcome of the image registration process are two sets of coordinates of correspondences in the images, a planar transformation matrix \(\mathbf H\) (homography) can be computed for each edge in the graph.

Cycles represent closed-loops in our context. When the relative homographies for each edge in the cycle are multiplied consecutively, one could expect to obtain the identity mapping. In practice this does not hold due to error accumulation. For each edge in the cycle, an equation can be written by applying circular shift operations. Equation 1 shows the possible equalities for the example cycle in Fig. 3. These equations allow for computing errors for individual edges. If there is no wrong edge in the cycle, then all individual errors should be relatively small. However, if there is a wrong edge in the cycle, the error will increase drastically for all the edges in the cycle. This might be sufficient to identify cycles with wrong edges but might not be fully suitable to pinpoint which one of the edges is wrong in the cycle. On the other hand, one edge can appear in more than one cycle, e.g., the ones in the MST. And also, cycles can be composed of different number of edges. The longer the cycle, the more error it accumulates. This might cause some inaccurate interpretations of the magnitude of the errors on cycles and edges.

To address all aforementioned issues, we propose to use error statistics on edges. We generate several MSTs using randomly-generated edge weights. For each MST, we find the corresponding cycle basis. Then, for each edge in cycles in the basis, the error is computed by using circular shift as in the Eq. 1. The outcome of this process is a list of errors for each edge in the graph. We generate an error histogram for each edge using the list of errors. If there is no wrong edge in a cycle, the error is not likely to be as high as in the cases where there is a wrong edge. Therefore, correct edges might have small errors mostly and their errors must be varying in a relatively large interval due to the appearance of wrong edges. This leads to the conclusion that the shape of the error histogram for a correct edge might be seen similar to an exponential distribution. A real example of error histograms can be seen in Fig. 4. Since the wrong edges are the main sources of error, regardless of the properties of the cycles they are involved in, they are more likely to have bigger errors. This results in the shape of the histogram being different than that of the exponential distribution. By comparing the shape of the histograms, wrong edges can be identified.

Fig. 4
figure 4

Sample Normalized Error Histogram Curves for Memorial Church Dataset. [Left] Error Histogram Curve for the edge between images 28 and 35. From the curve, it can be said that this pair is mistmatched as its error histogram is mostly composed of large errors. This means error for this edge was large for all the cycles it involved in. [Right] Error Histogram Curve for the edge between images 1 and 2. From the curve, it can be concluded that this pair is a correct edge as it has mostly smaller errors

In order to automate this process, we use supervised classification using the AdaBoostM1 algorithm (Freund and Schapire 1996) operating on error histograms.

(1)

3 Experimental results

To validate our proposal, we present experiments with different datasets from two main application domains of image mosaicing, namely panoramic imaging and optical mapping with unmanned platforms. For all experiments, we used 4-Degree-of-Freedom (DOF) similarity type planar transformations as they generally contain enough DOFs for optical mapping with robotic platforms. SIFT is used for feature detection and matching while RANSAC is employed for outlier rejection and transformation computation. Although we did some experiments with a different set of parameters, we used 250 randomly-generated cycle bases and 50 bins for error histograms. As a Global Alignment (GA) method, we use Symmetric Transfer Error Minimization (STEMin) over absolute homographies. The Symmetric Transfer Error (STE) is defined in Eq. 2:

$$\begin{aligned} \varepsilon&=\sum \limits _{k,t}\sum \limits _{j=1}^{s} \big (\parallel {^{k}\mathbf x _{j}}-{^{1}\mathbf H _{k}^{-1}}\cdot {^{1}\mathbf H _{t}} \cdot {^{t}\mathbf x _{j}} \parallel _2\\&\qquad \qquad + \parallel {}{^{t}\mathbf x _{j}}-{^{1} \mathbf H _{t}^{-1}} \cdot {^{1}\mathbf H _{k}} \cdot {^{k}\mathbf x _{j}} \parallel _2 \big ) \end{aligned}$$
(2)

where k and t are image indices that were successfully matched, s is the total number of correspondences between the overlapping image pairs, and \(({^{1}\mathbf H _{k}}, {^{1}\mathbf H _{t}})\) are the absolute transformation parameters for images k and t, respectively. This error term is minimizing the Euclidean distance between correspondences in each image frame. The first image frame is chosen as a global frame and fixed. To train a classifier, we used four different datasets obtained with unmanned underwater vehicles carrying a down-looking camera during typical seafloor surveys. The main characteristics of the datasets are summarized in Table 1.

Table 1 Main characteristics of training data

We added wrong edges between randomly selected nodes while the total number of wrong edges are controlled and gradually increased starting from 1 to max 15–20% of the total number of overlapping image pairs in the dataset. Summing up all 4 datasets, the final data for training a classifier is composed of 683,606 correct and 53,774 wrong edges.

The first dataset we used for testing is from a video of the Memorial Church in Stanford University Campus. This image sequence Footnote 2 corresponds to the dataset used in Davis (1998). The total number of images of \(192\,\times \,200\) pixels is 145 and the total number of overlapping image pairs is 2997.

We run STEMin using all overlapping image pairs and computed STE using all 261,804 correspondences. Mean STE is computed as 2.97 pixels, standard deviation is 1.89 while the maximum error is 344.37 pixels. These numbers suggest that there might be some mismatched image pairs and/or some outliers among the correspondences. We run our proposal and four image pairs were identified as wrong edges. After removing these image pairs, we run STEMin again and mean error is 2.89, standard deviation is 1.08 and maximum error is found as 25.34 computed using 261,678 correspondences. Furthermore, we examined these 4 image pairs; one of them is a mismatched pair (illustrated in Fig. 5) and the others have an overlapping area but they contain some outliers. The second dataset was acquired with a camera translating over a close range of a scene mostly composed of high color and texture containing books and boxes. The trajectory is a single loop-closing, drawing a smoothed rectangular shape. The dataset has 108 images of \(640\,\times \,480\) pixels with 1827 overlapping image pairs. Mean STE computed over 638,357 is 8.14 pixels. Standard deviation is 4.55 while the maximum error is 1263.67 pixels. Again such high maximum error suggests some outliers and/or mismatched image pairs. We run our proposal and eight overlapping image pairs were identified as mismatched. After removing those image pairs, we run STEMin. Mean STE computed over 638,093 was 8.10 pixels. Standard deviation has reduced to 3.67 pixels while maximum error was reduced to 69.46 pixels. We can infer that our proposal was able to remove the mismatched pairs and/or image pairs with outliers. For this dataset, we examined the removed 8 overlapping image pairs and there were no mismatched pairs but they were all containing some outlier(s). One of the removed image pairs is illustrated in Fig. 6.

Fig. 5
figure 5

Identified mismatched image pair for the Memorial Church Dataset. [Left] Figure shows the final trajectory and footprints of images 28 and 35 as rectangles. [Right] Images and correspondences. Since the scene has repetitive textures and high structural similarity, the image matching algorithm has produced a list of correspondences

Fig. 6
figure 6

One of the identified image pair for the second dataset.[Left] Figure shows the final trajectory and footprints of images 101 and 91 as rectangles. [Right] Images and correspondences. There is a single outlier in this case

The third dataset is a video of the S. Zeno Cathedral in VeronaFootnote 3 (Marzotto et al. 2004), similar to the first dataset. It is composed of 33 images of \(240\times 282\) pixels having 368 overlapping image pairs. We run STEMin and mean error, standard deviation and maximum error are computed as 1.83, 0.71 and 7.08 respectively. We run our method to see whether there was any problematic image pairs or not. The classifier returned an empty set, which could be predicted by looking at the error values. Then we randomly added a few wrong edges to the dataset and run again our proposal. We repeated this procedure 100 times. Although we have seen from the previous experiments that the total number of problematic image pairs is usually low considering the total number of overlapping image pairs, we added wrong edges up to 8–10% of the total number of overlapping image pairs during random trials. Table 2 summarizes the obtained results in a confusion matrix form. The total number of added wrong edges is 1365 while the total number of correct edges is \(100\times 368\). The total number of overlapping image pairs per image is illustrated in Fig. 7.

Fig. 7
figure 7

Overlapping image pair per image for the Szeno dataset. Please note that overlapping image pairs are counted for both images (e.g., The overlapping pair between image a and b is counted for both images)

Table 2 Confusion matrix for 100 random trials on the tested datasets

From the confusion matrix, it can be seen that only one wrong edge was classified as correct by our proposal and 99 correct edges were also identified as wrong. In order to see the effects of this classification errors on trajectory, we run STEMin on the remaining image pairs after removing the ones identified as wrong and computed STE on final trajectories obtained. The results are summarized in Table 3. Wrong edges were completely removed in the 99 trials out of 100. From the numbers in the table, it can be inferred that removing a small amount of correct edges did not have much impact, as opposed to not removing the wrong edge(s). This is also related with the dataset as it can be regarded as pretty “dense”, taking into account the total number of overlapping image pairs and images. Removing some correct edges did not have a notable effect. Our approach did not remove any correct edges on 45 trials out of 100, It removed only one correct edge on 32 trials, two edges on 9 trials, three edges on 10 trials, four edges on 2 trials, five and six edges on 1 trial.

Table 3 Error measures (in pixels) on final trajectories over 100 trials for the tested datasets

The fourth dataset (named as Underwater I) is a subset of an image sequence gathered by the \(ICTINEU^{AUV}\) (Ribas et al. 2007) underwater robot. The dataset is composed of 215 images (\(384\times 288\) pixels) and covers approximately \(400\,\mathrm{m}^2\). Time-consecutive images have overlap and the total number of registered overlapping image pairs number is 1297. This dataset is sparser and has low overlap between image pairs, when comparing to the Szeno Dataset. The confusion matrix is summarized for 100 random trials in Table 2 while the errors on the obtained trajectories are given in Table 3. Since the time-consecutive images have overlap in this dataset, we did not remove them even if our proposal suggested to do so. The approach proposed in this paper successfully removed the wrong edges in 92 out of 100 trials. The total number of wrong edges added for the remaining 8 cases (where our proposal was not able to identify all wrong edges) is 543, and 523 of them were successfully removed.

Our proposal relies on having cycles with and without wrong edges so that the different error histograms can be observed between correct and wrong edges. In other words, if a correct edge always involves cycles where there is at least one wrong edge, then this correct edge is most likely to be classified as wrong since its histogram will be very similar to those of wrong edges. This may not be a problem for dense datasets as the number of overlapping image pairs per image is relatively high (e.g., Szeno dataset, Fig. 7) and this increases the probability of having cycles without wrong edges. In the case of sparse datasets, this assumption of having cycles without wrong edges may not be realistic. In such cases, the use of additional information on image positions coming from different sensors could be useful. Increasing the total number of cycle bases could also be a solution, as it increases the probability of having cycles without wrong edges. However, the data characteristics may still not allow this assumption to be hold. For the Underwater I dataset, the total number of overlapping image pairs per image is illustrated in Fig. 8. As it can be seen, some images overlap with a reduced number of neighboring images. This may rule out the initial assumption depending on the randomly added wrong edges.

Fig. 8
figure 8

Overlapping image pair per image in the Underwater I dataset. Please note that overlapping image pairs are counted for both images (e.g., the overlapping pair between image a and b is counted for both images.). Some of the images have very few overlapping image pairs. This reduces the total number of cycles in which they could be involved. Depending on the wrong edges, these images might always be in the cycles along with wrong edges. This could lead them to be identified as wrong edges

The fifth dataset (named as Underwater II) was obtained using a digital camera carried by a Phantom XTL Unmanned Underwater Vehicle (UUV) during a survey of a patch of reef located in the Florida Reef Tract (Lirman et al. 2007). It consists of 486 images of \(512\,\times \,384\) pixels. The total number of overlapping image pairs is 3225. There are 6 time-consecutive image pairs that do not have any overlap. We kept the overlapping time-consecutive images even if they were identified as a wrong edge.

The confusion matrix is summarized for 100 random trials in Table 2 while the errors on the obtained trajectories are given in Table 3. The proposed method was not able to successfully identify all wrong edges in 5 trials out of 100. The total number of wrong edges added for those 5 trials were 898 and the proposed method was able to remove 890. From the confusion matrix, it can be seen that the total number of correct edges identified as wrong by our proposal can be considered as relatively high. This is mainly due to the way in which the classifier is trained. We opt to put high penalization for classifying wrong edges as correct since even one of them is usually enough to disturb the final result. To illustrate the performance of the method, the final mosaics and trajectories for one of the random trials (with the maximum STE error of 387.34) are illustrated in Fig. 9. In this case our proposal was not able to remove all wrong edges. The total number of overlapping image pairs per image is illustrated in Fig. 10.

Fig. 9
figure 9

[Top] Final Mosaics for Underwater II dataset rendered using last-on-top strategy. [Bottom] Final Trajectories: Solid line denotes the trajectory without any wrong edge while the dashed one shows the trajectory after removing wrong edges. A total number of 214 wrong edges were added and 212 of them were successfully removed. The remaining two (between images 201–168 and 202–131) of them has caused some degradation on the trajectory and mosaic, respectively

Fig. 10
figure 10

Overlapping image pair per image in the Underwater II Dataset. Please note that overlapping image pairs are counted for both images (e.g., the overlapping pair between image a and b is counted for both images)

4 Conclusions

Image registration is generally done by extracting and matching some salient points (features) in the images. Although some robust methods are employed, some mismatched image pairs can occur due to the repeated textures and high structural similarity in images. These mismatched pairs can cause significant degradation on the final result (e.g., trajectory, mosaic, 3D Reconstruction, etc.). In this paper, we propose a method to identify mismatched image pairs using close-loop constraints (cycles) in camera trajectory. When the motion between images in a closed-loop is accumulated, the combined motion should be close to identity.

Our proposal relies on generating different cycles and computing error histograms for each overlapping image pairs. Mismatched image pairs have different error histogram compared to the those of correct edges. To identify mismatched image pairs, we employ a machine learning technique operating on error histograms. We presented experimental results with both real and synthetic datasets under different scenarios. Our proposal was able to remove wrong edges successfully and also able to identify edges not fully mismatched but including some outliers. Our method can be embedded in any application that requires processing of image sequence.