1 Introduction

Multimodal images refer to two or more images captured by different types of imaging modalities such as CT (computed tomography), MRI (magnetic resonance imaging), PET (positron emission tomography), SPECT (single-photon emission computed tomography), just to name a few [30]. Characteristics of multimodal images have been studied in a wide range of applications, especially in medical diagnosis [19, 28, 42]. For instance, both similarities and differences between modalities are considered in modeling the progression of chronic diseases [36, 37].

This paper focuses on multimodal image registration, which aims to align corresponding objects in multimodal images [54]. This operation is essential and critical to producing effective performance in many computer vision applications, e.g. medical image analysis [39, 43], remote sensing [7] and aerial imagery analysis [51]. Registering multimodal images is very challenging in that there may exist substantial intensity variations between corresponding parts of images [25, 48].

1.1 Sample of challenging multimodal images

An example of multimodal images is given in Fig. 1a and b. The two images are captured by two types of microscopes: a standard light microscope and a confocal microscope. The standard light microscope is mounted with a color camera to capture the brightfield image,Footnote 1 like the one shown in Fig. 1a. The specimen here has been stained with colored dyes (blue for the nuclei and brown for the vessels) so all the color information is in the one image. Essentially this is just a standard digital image. The confocal microscope uses lasers of different wavelengths to excite different fluorochrome. Figure 1b shows an image captured using a confocal microscope. In this image, a laser of a specific range of wavelengths is used to excite the fluorescein isothiocyanate (FITC) dye that labels the blood vessels. The light emitted from the FITC dye is collected by asensor and recorded as a grey scale image. The green in the image is a pseudo color that has been assigned to that dye based upon its excitation wavelength. If multiple colors were required, multiple ranges of laser would be used, and thereby each color is collected as a grey scale image. An appropriate color is assigned and the images are then merged together to make a RGB image. We use Fig. 1 for illustration because multimodal microscopic images remain the most challenging among our test multimodal images.

Fig. 1
figure 1

An example of brightfield and confocal microscopic images. a: original brightfield image; b: original confocal image; c and d: images after being processed by our prior work (DSS: Detector of Structural Similarity) [31, 32] on (a) and (b) respectively

Obviously, there are very large content differences between Fig. 1a and b. Blue structures in the brightfield image do not appear in the confocal image. Moreover, some brown structures in the brightfield image cannot be clearly seen in the confocal image. To increase the structural similarity in Fig. 1a and b, the two images were pre-processed to Fig. 1c and d in our prior work (DSS: Detector of Structural Similarity) [31, 32]. Compared with Fig. 1a and b, corresponding image structures between Fig. 1c and d are much clearer. However, the content differences in such images as Fig. 1c and d are still large, which can be seen in two aspects. First, the pixels in the confocal image are all spatially close each other, whereas many pixels in the brightfield image are unconnected. Second, the brightfield image presents much larger intensity variations as compared to the confocal image.

It is still very challenging for existing feature-based multimodal image registration techniques such as [9, 10, 15, 40, 45] to effectively register images like Fig. 1c and d. These multimodal image registration techniques are based on keypoints which are sensitive to differences in image content such as intensities or gradients. Due to the large content differences between corresponding regions in images like Fig. 1c and d, the corresponding descriptors are not close no matter how discriminative the local descriptor itself is. This will hinder corresponding keypoints from being matched. Consequently, the accuracy of keypoint matches is unlikely to be high, thereby leading to a poor registration performance.

1.2 Contributions of this paper

To effectively register complex multimodal images, we propose a novel multimodal image registration technique by borrowing the main idea of the multimodal image registration framework in [26] and exploring feature representations of corners. The paper focuses on addressing two issues. First, image contents may differ largely between corresponding parts of multimodal images such as Fig. 1c and d. The second issue is that large scale changes may occur.

Our contributions in this paper are threefold. First, the proposed multimodal image registration technique is based on contour-based corners, which is independent of intensity and gradient changes in images. Second, a novel corner descriptor is proposed to represent edges in the neighborhood of corners. Third, we propose a simple yet effective way of estimating scale difference between two images.

This paper is an extension of our prior work [33], with the following major improvements.

  1. i.

    Analyzing how a state-of-the-art multimodal image registration technique [26] performs in handling large content differences and large scale differences between complex multimodal images,

  2. ii.

    A more detailed and accurate description of our proposed technique,

  3. iii.

    Experiments on registering multimodal images with large scale differences (up to four times), and

  4. iv.

    Evaluating the proposed technique more extensively, such as performance comparisons on non-microscopic images and microscopic images individually, and comparing with a benchmark intensity-based multimodal image registration technique [24].

The rest of the paper is structured as follows. Section 2 summarizes related multimodal image registration techniques. Sections 3 and 4 identify the limitations of a state-of-the-art multimodal image registration technique [26] in handling large content differences and large scale differences respectively. In Section 5, the proposed technique is presented, followed by a performance study in Section 6. The paper is concluded in Section 7.

2 Related work

Intensity-based image registration techniques such as [22, 24, 35, 38, 44] have gained popularity in registering multimodal images, especially in medical images. An intensity-based image registration technique estimates an optimal transformation between the reference and target images by comparing their intensity patterns [43]. Particularly, elastix [24] has been presented as a toolbox for intensity-based medical image registration. The elastix has properly integrated multiple choices in various modules such as transformation models and similarity measures, which allows users to tailor the toolbox to a specific image registration application. Due to its popularity and effectiveness, elastix will be used in this paper as one of the benchmark intensity-based multimodal image registration techniques for performance comparisons.

A second category of multimodal image registration techniques is based on local features. Note that our work is mainly focused on feature-based multi-modal image registration. Among such local features, multimodal variants of SIFT are particularly popular, including SIFT-GM (GM: Gradient Mirroring) [23], Symmetric SIFT [9], IS-SIFT (IS: Improved Symmetric) [20, 21, 45], GO-IS-SIFT [20, 45] (GO: Gradient Occurrences), PIIFD (Partial Intensity Invariant Feature Descriptor) [10], UR-SIFT-PIIFD (UR: Uniform Robust) [15], NG-SIFT (NG: Normalized Gradients) [40] and HD-MOG-IS-SIFT (HD: Higher Discrimination, MOG: Magnitudes and Occurrences of Gradient) [34]. These multimodal local descriptors take into account certain characteristics of multimodal images. PIIFD [10] is herein selected as a representative of the aforementioned multimodal variants of SIFT. On the basis of building orientation histograms within a local region as done in SIFT [29], PIIFD [10] has three main distinct properties. First, normalized gradient magnitudes are accumulated to the corresponding bin of an orientation histogram, thereby mitigating the effect of the change in gradient magnitudes between corresponding image contents. Second, gradient orientations are constrained to [0,180), which addresses the issue that gradient orientations at corresponding locations of multimodal images may point to opposite directions. This issue was discussed and called gradient reversal in [45]. Third, to address the issue that the main orientations of corresponding keypoints may point to opposite directions (referred to as region reversal in [45]), a linear combination is performed on two intermediate descriptors which are built for a local region and its rotated version by 180 . PIIFD was improved by UR-SIFT-PIIFD [15] in terms of the robustness to scale changes by enhancing the stability and distinctiveness of SIFT keypoints. However, UR-SIFT-PIIFD cannot effectively register multimodal images with large content differences since it still uses SIFT-like keypoints which rely heavily on intensity changes. As shown in [25], UR-SIFT-PIIFD even performs worse than PIIFD in registering multimodal images with complex intensity changes. Based on our analysis, the aforementioned multimodal variants of SIFT only consider straightforward characteristics of multimodal images such as gradient reversal, however the real situation may be more complex, such as registering the two images shown in Fig. 1c and d.

Moreover, there exist edge-based image registration techniques, such as ED-DB-ICP (Edge Driven Dual Bootstrap Iterative Closest Point) [47] and EOH (Edge Oriented Histogram) [12]. ED-DB-ICP [47] enriches SIFT with[shape context using edge points, but it is not robust to scale changes and noises. EOH [12] detects keypoints as SIFT does and then builds descriptors using the proposed edge oriented histograms. However, EOH is not scale-invariant since it determines the region size empirically when building descriptors. EOH refines keypoint matches by performing a scale restriction process [53]. However, scale invariance can not be achieved. Note that the estimated scale difference here refers to the scale factor attached to each keypoint detected by SIFT.

More recently, AB-SIFT (AB: Adaptive Binning) [41] and LoSPA (Low-dimensional Step Pattern Analysis) [25] have been proposed. AB-SIFT [41] mainly modifies SIFT in two aspects. First, the keypoint detection is improved so that keypoints are more robust to changes in scale and viewpoint. The second modification is the use of an adaptive histogram quantization strategy. AB-SIFT has shown advantages in registering remote sensing images, however it has limitations in dealing with nonlinear or complex intensity differences between multimodal images, as pointed out in [41]. To effectively register multimodal retinal images with complex intensity changes, LoSPA [25] focuses on intensity change patterns and 28 such patterns are empirically presented. In registering multimodal retinal images, LoSPA outperforms ED-DB-ICP, UR-SIFT-PIIFD and PIIFD (PIIFD performs best among the three). However, LoSPA is not scale-invariant and its registration performance is very poor when the scale difference is above 1.9 times, as reported in [25]. This is because there does not exist any ad hoc setting for achieving scale invariance. In building descriptors, LoSPA determines the region size empirically.

In [26], a multimodal image registration framework was proposed by making use of spatial and geometrical relationships of keypoint triplets. Its main idea is summarized as follows.

  1. i.

    Local descriptors are built. Relative to each keypoint in the reference image, all keypoints in the target image are ranked in terms of the distance to the reference keypoint. By doing so, an initial mapping for each reference keypoint is obtained.

  2. ii.

    Keypoint triplets are generated in the reference and target images.

  3. iii.

    For each reference keypoint, its best match is determined. This is achieved by iteratively searching and comparing all related pairs of keypoint triplets. To evaluate the transformation calculated from a pair of keypoint triplets, the similarity metric is defined to be the Number of Overlapped Pixels (NOP) between edges of the two entire images, which allows for Global Information (GI) to be incorporated.

  4. iv.

    All keypoint matches are ranked by their NOP values. A threshold is set to select keypoint matches that hold highest NOP values.

  5. v.

    RANSAC [14] is used to refine keypoint matches.

  6. vi.

    A transformation is estimated from the refined keypoint matches and is used for aligning the reference and target images.

In [26], SIFT [29] and PIIFD [10] are used as local descriptors. Accordingly, the multimodal image registration techniques are called GI-SIFT and GI-PIIFD respectively in this paper. Theoretically, the multimodal image registration framework in [26] should work with any other local descriptor. Based on our analysis, the main problem of [26] lies in lack of discriminative feature representations and accurate scale estimation. The local descriptor, regardless of SIFT or PIIFD, is neither invariant to large content differences nor invariant to large scale differences. Some may argue that this problem can possibly be addressed by a more competitive local descriptor. To the best of our knowledge, there exists no local descriptor so far which can decently deal with both large content differences and large scale differences when registering multimodal images.

It has been shown in [26] that GI-SIFT and GI-PIIFD significantly improve SIFT and PIIFD respectively. This validates the effectiveness of the multimodal image registration framework. We assume that the registration performance would be further enhanced by exploring a more robust feature representation and accurately estimating the scale difference between images. Moreover, GI-PIIFD outperforms GI-SIFT when registering multimodal images, as reported in [26]. Thus, GI-PIIFD will be used as a benchmark technique for performance comparisons in this paper. In Sections 3 and 4, we will analyze how GI-PIIFD performs in handling large content differences and large scale differences respectively.

3 Large content differences between complex multimodal images

It is challenging to register multimodal images with large content differences when using the PIIFD descriptor. In PIIFD, keypoints are detected using the Harris corner detector which relies on intensity variations in a small neighborhood [17]. The PIIFD descriptor is built based on a local region around each keypoint. In each local region, normalized gradient magnitudes are used to build orientation histograms. Due to the use of gradient information, the PIIFD descriptor is sensitive to content differences within the local region. Figure 2 illustrates such an example of large content differences between corresponding parts of Fig. 1c and d.

Fig. 2
figure 2

Illustrating large content differences between complex multimodal images. The two image patches are corresponding parts of Fig. 1 c and d. A red dot represents a keypoint detected by PIIFD [10]. A PIIFD descriptor is built in a local region as enclosed by a green square

By comparison, we observe that curvatures of corners are relatively more robust to content differences. The Fast-CPDA corner detector [2, 4] is used. The Fast-CPDA corner detector estimates curvatures of contour points using the chord-to-point distance accumulation technique [16]. Those maxima contour points with regard to curvature values are treated as candidate corners. Thus, the curvature of a Fast-CPDA corner is independent of intensity and gradient changes in the neighborhood of the corner.

Figure 3 shows a pair of corresponding corners which are detected by the Fast-CPDA corner detector. Note that, the local regions highlighted in Figs. 2 and 3 are equivalent. Based on the curvature estimation in the Fast-CPDA corner detector, the curvatures for the two corners in Fig. 3a and b are very similar, despite of large content differences between the two regions. Hence, curvatures of Fast-CPDA corners are more robust to content differences as compared to PIIFD descriptors.

Fig. 3
figure 3

Illustrating curvature similarity of corresponding corners. A red dot represents a corner detected by the Fast-CPDA corner detector [2, 4]. a and b correspond to Fig. 2 a and b respectively. The local region enclosed by a dashed square is the same as in Fig. 2

4 Scale invariance

Scale invariance will be discussed in this section. First, we will analyze the significance of scale invariance to image registration. Next, we will illustrate how the PIIFD descriptor is not invariant to scale differences and its impact on GI-PIIFD.

4.1 Significance of scale invariance to image registration

It is important to achieve scale invariance in registering images as the reference and target images may contain structures at different scales [27]. For a feature-based image registration technique such as [9, 10, 45], a scale is estimated and assigned to each keypoint in a scale-space representation [27]. The scale of a keypoint determines the size of the local region in which a descriptor is built. Thus, the accuracy of the scale estimation directly affects the feature description and matching performances. If the estimated scale is inaccurate, the distance between a pair of corresponding keypoints is likely to be larger than it should be. Consequently, there will be a high possibility that this potentially true match is rejected in the matching stage. Due to an inaccurate scale estimation, the final registration performance is likely to be undermined.

4.2 Scale variance of PIIFD descriptor

The PIIFD descriptor was proposed in [10] for registering multimodal retinal images. The size of a local region for building PIIFD descriptor is fixed at 40 × 40 pixels because there is a minor scale difference between retinal images tested in [10]. Using the same setting as [10] for the size of local regions, we have illustrated corresponding keypoints which are manually extracted from brightfield and confocal images, as shown in Fig. 4. Figure 4c is three times bigger than Fig. 4a and b with respect to scales. The local regions in Fig. 4a and c only partially correspond. Accordingly, the image structures which are represented in building PIIFD descriptors are not equivalent.

Fig. 4
figure 4

A visual comparison of local regions for building PIIFD descriptors at different scales. A red dot in each sub-figure represents a PIIFD keypoint. Images in (a) and (b) are at similar scales. The scale difference between (c) and (b) is three times. In (c), the local region in the blue square is used for building the PIIFD descriptor, and the small region within the green square corresponds to the regions in (a) and (b)

We now explain how GI-PIIFD [26] is affected by the scale variance of the PIIFD descriptor since GI-PIIFD will be used as the benchmark multimodal image registration technique for evaluating our proposed technique in this paper. GI-PIIFD determines initial mappings of keypoints by selecting a set of closest descriptors, followed by matching triplets of keypoints. Due to the scale variance of the PIIFD descriptor, the number of correspondences appearing in initial mappings is likely to decrease as the scale difference between the reference and target images increases. Figure 5 gives two examples of correspondences which appear in initial mappings of GI-PIIFD when registering images with similar scales and with a scale difference of three times respectively. There are 33 of 58 correspondences in registering Fig. 5a and b, whereas there are only two of 21 correspondences in registering Fig. 5c and d. Herein a latter number, 58 or 21, denotes the number of correspondences between PIIFD keypoints which are detected in the reference and target images. Obviously, there is no chance of matching a triplet pair where all the three pairs of keypoints are all correspondences, in registering Fig. 5c and d. Consequently, it is impossible to effectively register the two images.

Fig. 5
figure 5

Illustrating how GI-PIIFD is affected by the scale variance of PIIFD descriptor. Red dots indicate correspondences which appear in initial mappings of keypoints using GI-PIIFD. (a) and (b) are at similar scales; the scale difference between (c) and (d) is three times. Here (c) and (d) are shown at similar sizes only for the purpose of a clear illustration. The two blue arrows in (c) and (d) point to two keypoints. In registering (a) and (b), 33 correspondences appear in initial mappings of keypoints, but only two correspondences in registering (c) and (d)

Sections 3 and 4 have shown that PIIFD performs poorly in dealing with large content differences and large scale differences respectively. Admittedly, some other local descriptor may perform better than PIIFD in dealing with large content differences or large scale differences. One example is that LoSPA [25] may be, to some extent, more robust than PIIFD in handling large content differences. However, LoSPA is not sufficiently robust to scale changes in multimodal images, as stated in Section 2. Generally, our analysis shows that any existing intensity-based or gradient-based local descriptor is unlikely to be effective in registering multimodal images with large differences in both content and scale. Thus, we propose a multimodal image registration technique based on corners.

5 Proposed technique

This section elaborates our proposed COREG. An overview of COREG is first given, followed by a few key issues in detail.

5.1 Overview of COREG

COREG is designed based on the registration framework in [26]. GI-PIIFD [26] has limitations in handling large content differences and large scale differences when registering multimodal images, as stated in Sections 3 and 4. Overall, our aim is to achieve greater robustness to large differences in image contents and scale as compared to GI-PIIFD [26]. To achieve greater robustness to large content differences, we will explore curvature similarity between corners and propose a novel corner descriptor, which will be elaborated in Sections 5.2 and 5.4. To deal with large scale differences, a novel way of scale estimation will be proposed by taking into account geometric relationships between corner triplets, which will be discussed in Section 5.3.

The steps in COREG are as follows.

  1. i.

    Detecting corners Corners are detected in the reference and target images using the Fast-CPDA corner detector [2, 4].

  2. ii.

    Determining initial mappings of corners using curvature similarities Relative to each reference corner, curvature similarities of all the corners in the target image are ranked. By selecting highly-ranked corners, candidate matches of each reference corner are determined. Curvature similarity will be described in Section 5.2.

  3. iii.

    First round matching of corner triplets With initial mappings of corners determined in Step ii, all the possible mappings of corner triplets are generated. Each pair of corner triplets in the reference and target images are compared and accordingly a transformation is computed. The transformation is used to transform the target image onto the reference image. The corresponding edge images are overlapped and therefore the Number of Overlapped Pixels (NOP) is computed. By comparing NOP values, the pair of corner triplets with the maximum NOP is selected. The triplet pair selected is denoted as T P 1.

  4. iv.

    Estimating a scale difference between the reference and target images The scale difference between the reference and target images is estimated from the pair of corner triplet T P 1. The estimated scale difference is obtained by averaging the length ratios between corresponding line segments in the two corner triplets. This will be illustrated in Section 5.3.

  5. v.

    Second round matching of corner triplets First, the reference and target images are resized using the scale difference estimated in Step iv. Second, a novel local descriptor called Distribution of Edge Pixels Along Contour (DEPAC) is built for each corner. The proposed DEPAC descriptor will be stated in Section 5.4. Similar to Step ii, the initial mappings of corners can be determined by ranking the DEPAC descriptor distances. Next, the matching of corner triplets is carried out based on curvature similarity and the DEPAC descriptor respectively. Accordingly, two pairs of corner triplets are obtained. The pair of corner triplets which correspond to a higher NOP is denoted as T P 2.

  6. vi.

    Determining a triplet pair The two triplet pairs, T P 1 and T P 2, are compared in terms of NOP. A decision is made to select the triplet pair with the higher NOP. The selected triplet pair is denoted as T P s .

  7. vii.

    Refining localizations of the selected pair of corner triplets T P s With the triplet pair determined, the localizations of corner pairs in the triplet pair are refined in a small neighborhood. If a higher NOP can be achieved, then the triplet pair is updated with the refined corner localizations. This will be discussed in Section 5.5.

  8. viii.

    Estimating a transformation and aligning images A transformation is estimated from the selected pair of corner triplet T P s . The estimated transformation is finally used for aligning the reference and target images.

Table 1 compares the steps in COREG and GI-PIIFD [26], which clearly indicates the differences between the two techniques. Compared with GI-PIIFD, the novelties of COREG lie in Steps ii, iv, v and vii. For Steps ii and v, we will describe curvature similarity between corners in Section 5.2 and the DEPAC descriptor in Section 5.4. Steps iv and vii will be elaborated in Sections 5.3 and 5.5 respectively.

Table 1 Comparing steps in COREG and GI-PIIFD

5.2 Curvature similarity between corners

Let us first define corners in the reference and target images as

$$ C_{r}=\left\{ {C^{1}_{r}}, {C^{2}_{r}}, \ldots\, C^{N_{r}}_{r} \right\}, \\ $$
(1)

and

$$ C_{t}=\left\{ {C^{1}_{t}}, {C^{2}_{t}}, \ldots\, C^{N_{t}}_{t} \right\}, \\ $$
(2)

where N r and N t denote the number of corners in the reference and target images respectively. Likewise, the curvatures of corners are defined as

$$ K_{r}=\left\{ {K^{1}_{r}}, {K^{2}_{r}}, \ldots\, K^{N_{r}}_{r} \right\}, \\ $$
(3)

and

$$ K_{t}=\left\{ {K^{1}_{t}}, {K^{2}_{t}}, \ldots\, K^{N_{t}}_{t} \right\}. \\ $$
(4)

Given two corners from the reference and target images, their curvature similarity is defined as

$$ s^{ij}=\frac{\left|{K^{i}_{r}}-{K^{j}_{t}}\right|}{{K^{i}_{r}}}, \\ $$
(5)

where 1 ≤ iN r and 1 ≤ jN t . Explicitly, the smaller a s ij value is, the higher the curvature similarity between two corners is.

With the curvature similarity defined in (5), all the corners in the target image are ranked by their curvature similarities relative to each reference corner. The highly-ranked corners comprise candidate matches. Thus, a reference corner is mapped to these candidate matches as

$$ {C^{i}_{r}} \mapsto \left\{ {C^{1}_{t}}, {C^{2}_{t}}, \ldots\, C^{N_{c}}_{t} \right\}, \\ $$
(6)

where N c represents the number of candidate matches. Given three corners \({C^{i}_{r}}, {C^{j}_{r}}\) and \({C^{k}_{r}}\) in the reference image, a corner triplet is generated. With candidate matches relative to each reference corner as (6) describes for \({C^{i}_{r}}\), all the possible corner triplets are generated in the target image.

5.3 Scale estimation

As stated in Step iii of COREG in Section 5.1, a pair of corner triplets, T P 1, is selected after the first round matching of corner triplets. Our way of estimating a scale difference is based on the triplet pair T P 1. Figure 6 shows T P 1 in registering a pair of brightfield and confocal images. The three corners \({C^{i}_{t}}\), \({C^{j}_{t}}\) and \({C^{k}_{t}}\) in the brightfield image correspond to the three corners \({C^{i}_{r}}\), \({C^{j}_{r}}\) and \({C^{k}_{r}}\) in the confocal image. With the three corner pairs, the scale difference between the two images is estimated by averaging the length ratios between corresponding line segments in the two corner triplets, i.e.,

$$ \begin{array}{llll} \sigma = \frac{1}{3} \times \left( \frac{\left|\overrightarrow{{C^{i}_{r}} {C^{j}_{r}}}\right|}{\left|\overrightarrow{{C^{i}_{t}} {C^{j}_{t}}}\right|} + \frac{\left|\overrightarrow{{C^{j}_{r}} {C^{k}_{r}}}\right|}{\left|\overrightarrow{{C^{j}_{t}} {C^{k}_{t}}}\right|} + \frac{\left|\overrightarrow{{C^{k}_{r}} {C^{i}_{r}}}\right|}{\left|\overrightarrow{{C^{k}_{t}} {C^{i}_{t}}}\right|}\right). \end{array} $$
(7)
Fig. 6
figure 6

An example of a triplet pair for estimating the scale difference. The actual scale difference between the two images is 1:2.73 and the estimated scale difference is 1:2.82. Here the two images are displayed at similar scales so that readers can find correspondences easily

In the example shown in Fig. 6, the ground-truth scale difference between the brightfield and confocal images is 1:2.73, whereas the estimated scale difference is 1:2.82. We can see the estimated scale difference is quite close to the ground-truth one. The accuracy of scale estimation for all the test image pairs will be illustrated in Section 6.3.

Note that, the accuracy of scale estimation is largely affected by the correctness of the triplet pair T P 1. As stated in Section 5.1, this triplet pair leads to the maximum NOP, indicating a very high similarity between edges of two images. Thus, there is a very high likelihood that this triplet pair is correct for estimating scale difference. In case the triplet pair T P 1 is incorrect, we suggest the following to obtain a desired triplet pair. First, a different edge detector [13, 50] is used when calculating NOP. Our observation is that the accuracy of calculating NOP is directly affected by the quality of the edge detector used. Second, instead of only using curvatures of corners (Step ii in Section 5.1), both curvatures and our proposed DEPAC descriptor are used to determine initial mappings of corners. A more accurate feature description should improve the quality of initial mappings of corners.

5.4 DEPAC: our proposed corner descriptor

Curvature [1,2,3,4, 18, 46] is an important representation of corners. The curvature of a corner describes how the edge pixels move along the contour of the corner in a small neighborhood. In order to better represent corners, we will propose a novel corner descriptor. Firstly, an example is given to illustrate the limitations of representing corners only using their curvatures. Figure 7a and b show two corners and their contours that are extracted respectively from a reference image and its target image in our test image pairs. The two corners are not corresponding in terms of ground-truth locations. The curvatures of the two corners are very close as the edges in a small neighborhood are structurally very similar. However, the edge structures in a larger neighborhood are significantly more different. Based on this analysis, a novel corner descriptor is proposed in order to capture more edge information surrounding a corner as compared to its curvature. Note that only the edge pixels along the contour where the corner is located are represented in the proposed corner descriptor, due to the fact that the number of edges may largely differ in the corresponding parts of multimodal images. Thus, the proposed corner descriptor is called Distribution of Edge Pixels Along Contour (DEPAC).

Fig. 7
figure 7

Building the DEPAC descriptor. A green dot denotes a corner. a and b show the corner and the contour where the corner is located in reference and target image respectively. In c and d, the neighborhood around a contour is divided into 16 sub-regions as red circles and blue lines indicate. The arrow in (c) and (d) points to main orientation. Note that only the semicircle which contains the contour is useful for building the DEPAC descriptor

Let \({C^{i}_{r}}\), \({C^{j}_{t}}\), \(\Gamma \left ({C^{i}_{r}}\right )\) and \(\Gamma \left ({C^{j}_{t}}\right )\) denote the two corners and their contours shown in Fig. 7a and b. We illustrate how a DEPAC corner descriptor is built using \({C^{i}_{r}}\) and \(\Gamma \left ({C^{i}_{r}}\right )\) as follows.

  1. i.

    Concentric circles are plotted by taking the corner as the center, as shown in Fig. 7c. Let R denote the radius of the internal circle. The radius of a concentric circle is incremented by R, from inside to outside. In our implementations, R is set to five pixels.

  2. ii.

    The main orientation of the corner, O m , is defined by averaging the orientations of two tangents [18]. In Fig. 7c, the arrow points to main orientation.

  3. iii.

    Orientation bins are defined at the two sides of the main orientation. As plotted by blue lines in Fig. 7c, the four quantized orientations are O 1 = O m − 90, O 2 = O m − 45, O 3 = O m and O 4 = O m + 45 in an anticlockwise direction. With four concentric circles and four quantized orientations, 16 sub-regions are defined in the neighborhood of the corner and each sub-region is denoted as (c,o), where 1 ≤ c ≤ 4 and 1 ≤ o ≤ 4.

  4. iv.

    In the sub-region (c,o), the number of edge pixels is incremented by one if an edge pixel, P e , along the contour falls into this sub-region, i.e.

    $$ (c-1) \times R < d\left( P_{e},{C^{i}_{r}}\right)\leq c \times R, \\ $$
    (8)

    and

    $$ O_{o} \leq \overrightarrow{{C^{i}_{r}}P_{e}} < O_{o+1}, \\ $$
    (9)

    where \(d\left (P_{e},{C^{i}_{r}}\right )\) is the Euclidean distance between P e and \({C^{i}_{r}}\), 1 ≤ c ≤ 4 and 1 ≤ o ≤ 4. Equations (8) and (9) represent the distance and orientation conditions P e should satisfy. The number of edge pixels computed for the sub-region (c,o) is denoted as N E P c,o . For the corner \({C^{i}_{r}}\) shown in Fig. 7c, the number of edge pixels in each sub-region is listed in Table 2.

  5. v.

    The number of edge pixels in each sub-region, N E P c,o , is normalized into [0,1] by

    $$ NEP_{c,o} = \frac{NEP_{c,o}}{max\{NEP_{c,o}\}}. \\ $$
    (10)

    With the normalized N E P c,o , the DEPAC descriptor is built.

Table 2 Number of edge pixels in each sub-region for corner \({C^{i}_{r}}\)

To compare the DEPAC descriptors built for the two corners, \({C^{i}_{r}}\) and \({C^{j}_{t}}\), the number of edge pixels in each sub-region for \({C^{j}_{t}}\) is listed in Table 3. We can clearly see that the two DEPAC descriptors are very different. Thus, our proposed DEPAC descriptor captures important edge information in the neighborhood of a corner.

Table 3 Number of edge pixels in each sub-region for corner \({C^{j}_{t}}\)

It should be noted that scale invariance must be ensured in building DEPAC descriptors for corners in the reference and target images. Ideally, the size of concentric circles for building DEPAC descriptors should be in line with the actual scale difference between the reference and target images. To achieve scale invariance, the estimated scale difference σ, which has been discussed in Section 5.3, is used as

$$ R_{r} = \sigma \times R_{t}, $$
(11)

where R r and R t denote the radius values of the internal circle for building DEPAC descriptors in the reference and target images, respectively.

5.5 Refining localizations

As stated in Section 5.1, a triplet pair, T P s , is selected from T P 1 and T P 2 by selecting the one with a higher NOP value. Let \({{C^{i}_{r}}, {C^{j}_{r}}, {C^{k}_{r}}} \mapsto {{C^{i}_{t}}, {C^{j}_{t}}, {C^{k}_{t}}}\) denote T P s , thus \({C^{i}_{r}} \mapsto {C^{i}_{t}}\) is a match of corners. Based on our analysis, two corners of a match in this triplet pair might not be accurately corresponding. As shown in Fig. 8, there may exist an image pixel, \({C^{x}_{t}}\), in a small neighborhood of the corner \({C^{i}_{t}}\), and \({C^{i}_{r}} \mapsto {C^{x}_{t}}\) is more accurate than \({C^{i}_{r}} \mapsto {C^{i}_{t}}\). This phenomenon is very likely to occur in multimodal images due to a localization error in detecting corners. Such an error can be caused by different amounts of noises at corresponding parts between multimodal images.

Fig. 8
figure 8

Refining localizations. \({C^{i}_{r}} \mapsto {C^{i}_{t}}\) is a match of corners in a triplet pair T P s . \({C^{x}_{t}}\) denotes a corner within the small neighborhood of \({C^{i}_{t}}\). \({C^{i}_{r}} \mapsto {C^{x}_{t}}\) is actually more accurate than \({C^{i}_{r}} \mapsto {C^{i}_{t}}\)

The refinement of localizations is carried out by searching image pixels in an w × w window, where w is the width of the searching window. Note that the searching process is only performed in the target image while the corner localizations of the triplet pair in the reference image remain unchanged. As the searching window is set for each corner of the triplet in the target image, (w × w)3 = w 6 triplet pairs are additionally generated. If any triplet pair out of these w 6 pairs achieves a higher NOP, the triplet pair \({{C^{i}_{r}}, {C^{j}_{r}}, {C^{k}_{r}}} \mapsto {{C^{i}_{t}}, {C^{j}_{t}}, {C^{k}_{t}}}\) is accordingly updated. In our experiments, w is equivalent to five.

5.6 A special consideration

In COREG, spatial relationships between corners are employed by making use of corner triplets. If the number of corners is smaller than three, it is impossible to generate a corner triplet. In such cases, the registration process will be terminated. Thus, a special consideration must be taken to ensure there are sufficient corners for generating at least one corner triplet. In the Fast-CPDA corner detector [2, 4], edges are detected using the Canny edge detector [8]. In the Canny edge detector [8], a high threshold and a low threshold are used to define strong and weak edge pixels respectively. In COREG, the high threshold for the Canny edge detector is empirically lowered to preserve more edges so that more corners are potentially detected, in the cases where the number of corners is smaller than three using the default threshold.

6 Performance study

To evaluate the proposed COREG, the following comparisons will be made. First, we will measure the accuracy of the proposed way of estimating scale differences (Section 6.3). Then, the registration performance is extensively evaluated (Sections 6.4.1 and 6.4.2). Moreover, an efficiency analysis is given (Section 6.4.3).

6.1 Test data

Five multimodal datasets were tested in our experiments. Dataset 1 includes two artificial pairs in which image contrast is reversed between the reference and target images. Dataset 2 includes 18 NIR (Near Infra-Red) vs EO (Electro-Optical) image pairs. Dataset 3 includes four image pairs used in [52]. The four image pairs include three MRI pairs and one EO vs IR (Infra-Red) pair. The three MRI pairs are of different weighting patterns:Footnote 2 T1 vs T2, T1 vs PD (Proton Density), and T2 vs PD, for each. Dataset 4 includes 16 brightfield and confocal microscopic image pairs such as Fig. 1c and d. Dataset 5 includes 81 brightfield and fluorescence microscopic images. Figure 9 shows sample image pairs for Datasets 1, 2, 3 and 5. The sample image pair of Dataset 4 has been shown in Fig. 1a and b.

Fig. 9
figure 9

Examples of our test multimodal image pairs. a and b: Artificial; c and d: NIR vs EO; e and f: MRI (T1 vs T2); g and h: EO vs IR; i and j: brightfield and fluorescence microscopic. The scale difference between (i) and (j) is 1X vs 4X

Datasets 1 to 4 include 40 image pairs and we call them the base image pairs. In these pairs, the scale difference between the reference and target images varies from 1:0.70 to 1:1.07. With these base image pairs, we have manually generated corresponding image pairs which have scale differences of 1.5, 2, 3 and 4 times, respectively. Thus, five scale differences are tested. For the referencing purpose, these five scale differences are labeled as 1X vs 1X, 1X vs 1.5X, 1X vs 2X, 1X vs 3X and 1X vs 4X, respectively. Here, X is equivalent to times with regard to a scale difference. Different from Datasets 1 to 4, the scale difference in each image pair of Dataset 5 is real, rather than being manually generated. Dataset 5 contains 27, 36 and 18 image pairs with 1X vs 1X, 1X vs 2X and 1X vs 4X scale difference respectively.

6.2 Evaluation metric

To carry out quantitative performance comparisons, average registration error [49] is used to measure the overlap error after aligning the reference and target images with the estimated transformation. Average registration error (called ARE in this paper) is defined as

$$ ARE = \frac{1}{H \times W}\sum\limits_{x=1}^{W}\sum\limits_{y=1}^{H}\|T_{e}(x,y)-T_{g}(x,y)\|, $$
(12)

where H and W are the height and width of the reference image, T g is the ground-truth transformation and T e is the estimated transformation. The smaller the ARE value is, the better the registration performance will be.

The ground-truth transformations for Datasets 1, 2 and 3 are known or provided [52]. For Datasets 4 and 5, the ground-truth transformation of each image pair is calculated by a set of corresponding pixels which were manually selected.

6.3 Accuracy of scale estimation

As discussed in Section 4, achieving scale invariance is of critical importance in the process of image registration. In our proposed COREG, the reference and target images are resized using the estimated scale difference. If the estimated scale difference is close to the ground-truth scale difference, the reference and target images will have similar scales after being resized. Here, the accuracy of scale estimation is measured by an error which deviates from the ground-truth scale difference. Let σ e and σ g denote the estimated scale difference and the ground-truth scale difference respectively. The error of estimating a scale difference is defined as

$$ \varepsilon_{s}=\frac{|\sigma_{e}-\sigma_{g}|}{\sigma_{g}} \times 100\%. $$
(13)

Figure 10 compares the estimated and ground-truth scale differences for 40 image pairs of Dataset 1 to 4 at all the five scale differences. It can be seen in Fig. 10 that the estimated scale difference is in many cases close to the ground-truth scale difference. With the measure defined in (13) for accuracy of scale estimation, a threshold of ε s is set to 5%. For these 40 pairs, with five scale differences from 1X vs 1X to 1X vs 4X, ε s is below 5% in 33, 36, 35, 34 and 36 pairs, respectively. Clearly, our way of estimating scale differences is very accurate and robust even when the scale difference between two images is large.

Fig. 10
figure 10

Comparing estimated and ground-truth scale differences. From bottom to top, these five groups of lines are for 1X vs 1X, 1X vs 1.5X, 1X vs 2X, 1X vs 3X and 1X vs 4X respectively

6.4 Performance comparisons

6.4.1 Comparisons in ARE

Figures 11121314 and 15 present ARE results of registering image pairs from Datasets 1 to 4. All five patterns of scale differences, i.e. 1X vs 1X, 1X vs 1.5X, 1X vs 2X, 1X vs 3X and 1X vs 4X, have been tested. For each dataset, IDs of image pairs and their corresponding imaging conditions are listed in Table 4. Note that, brightfield and confocal microscopic images (Pairs 25 to 40) have been processed by DSS [31, 32] to increase the structural similarity.

Fig. 11
figure 11

ARE comparisons between elastix, GI-PIIFD and COREG for image pairs of 1X vs 1X scale difference from Datasets 1 to 4

Fig. 12
figure 12

ARE comparisons between elastix, GI-PIIFD and COREG for image pairs of 1X vs 1.5X scale difference from Datasets 1 to 4

Fig. 13
figure 13

ARE comparisons between elastix, GI-PIIFD and COREG for image pairs of 1X vs 2X scale difference from Datasets 1 to 4

Fig. 14
figure 14

ARE comparisons between elastix, GI-PIIFD and COREG for image pairs of 1X vs 3X scale difference from Datasets 1 to 4

Fig. 15
figure 15

ARE comparisons between elastix, GI-PIIFD and COREG for image pairs of 1X vs 4X scale difference from Datasets 1 to 4

Table 4 Pair IDs and imaging condition

Figure 11 compares ARE achieved by elastix [24], GI-PIIFD and COREG when registering image pairs of 1X vs 1X scale difference from Datasets 1 to 4. Clearly, COREG far outperforms elastix and GI-PIIFD. The average ARE achieved by elastix, GI-PIIFD and COREG is 9.12, 6.92 and 2.27 respectively. Since our work is focused on multimodal image registration based on local features, we have only tested elastix at the 1X vs 1X scale difference. As shown in Fig. 11, the advantage of COREG over elastix is already very clear. Overall, elastix and GI-PIIFD perform very poorly in registering Pairs 24 to 40, whereas COREG performs much better. Pairs 25 to 40 are brighfield and confocal microscopic images. Content differences in these images are still large after being processed by DSS [31, 32]. Regarding Pair 24, the objects are very unclear and content differences are very large, as show in Fig. 9g and h.

For the other four patterns of scale differences, GI-PIIFD and COREG are compared in Figs. 121314 and 15. As the scale difference increases, GI-PIIFD performs increasingly poor, whereas COREG is much more robust. In other words, the advantage of COREG over GI-PIIFD is more significant as the scale difference increases. Table 5 compares average ARE values between GI-PIIFD and COREG for the five patterns of scale differences. The advantage of COREG over GI-PIIFD is very clear. Note that the special consideration, described in Section 5.6, has been taken for registering image pair 11 across all five scale differences, as less than three corners have been detected when using default settings of the corner detector [18]. More specifically, the high threshold for the Canny edge detector is lowered from 0.35 to 0.25 in registering this image pair.

Table 5 Average ARE of GI-PIIFD and COREG when registering image pairs of each scale difference from Datasets 1 to 4

Moreover, Fig. 16 compares GI-PIIFD and COREG in terms of ARE when registering brightfield and fluorescence microscopic images. Consistently, COREG achieves a lot smaller ARE compared to GI-PIIFD. The average ARE value achieved by GI-PIIFD and COREG is 21.19 and 4.00 respectively. From Pair 64 to 81, the scale difference between two images is 1X vs 4X. In registering these images using GI-PIIFD, ARE values are obviously bigger compared to registration of Pairs 1 to 63. By comparison, ARE values achieved by COREG remain relatively stable. Throughout all image pairs, the biggest ARE value is 7.88 when COREG is used. Hence, COREG is more robust than GI-PIIFD in dealing with large scale differences.

Fig. 16
figure 16

ARE comparisons between GI-PIIFD and COREG when registering brightfield and fluorescence microscopic images. The scale difference is 1X vs 1X, 1X vs 2X and 1X vs 4X for Pairs 1 to 27, Pairs 28 to 63 and Pairs 64 to 81 respectively

Based on the above comparisons, the following is summarized. First, all three techniques, i.e. elatix, GI-PIIFD and COREG, achieve satisfactory registration performance when registering images without large content and scale differences, such as Pairs 1 to 23 in Fig. 11. Second, COREG generally outperforms elastix and GI-PIIFD when dealing with images in which there exist large content and /or scale differences.

6.4.2 Comparisons in registration accuracy

Figure 17 compares registration accuracy of GI-PIIFD and COREG in registering a pair of brightfield and confocal images. Note that the two images have been processed by DSS [31, 32]. The alignments achieved by GI-PIIFD and COREG are compared using checkerboard images. To generate an aligned image, an estimated transformation is used to transform a brightfield image onto its corresponding confocal image. The transformed brightfield image and confocal image are displayed in an alternate way using the checkerboard format. To better identify alignments of image structures, the foregrounds of the brightfield and confocal images are displayed using red and green colors respectively in the checkerboard image. In the example shown in Fig. 17, the actual scale difference between the color and confocal images is 1:3.76. The ARE values achieved by GI-PIIFD and COREG are 121.61 and 4.87 respectively. To easily compare alignments achieved by GI-PIIFD and COREG, a small area of corresponding parts is extracted from the color and confocal images, as shown in Fig. 17e and f. Clearly, Fig. 17h shows a much better alignment as compared to Fig. 17g. Thus, COREG significantly improves the registration performance over GI-PIIFD.

Fig. 17
figure 17

An alignment example. (a) and (b) are a pair of brightfield and confocal images which have been processed by DSS [31, 32]. (c) and (d) are aligned results when registering (a) and (b) using GI-PIIFD and COREG respectively. Checkerboard is used for a better illustration. (e) and (f) are corresponding parts manually extracted from (a) and (b) respectively. (g) and (h) are manually extracted from (c) and (d) to illustrate how (e) and (f) are aligned using GI-PIIFD and COREG respectively

6.4.3 Efficiency analysis

Although our focus is on improving the registration accuracy, we now give a rough efficiency comparison between GI-PIIFD and COREG as follows.

  1. i.

    In registering image pairs with the same or similar scales, GI-PIIFD is approximately 12% faster than COREG. When COREG was used, less than 45 minutes were spent for registering a pair of our test multimodal images. Since our experiments were carried out on Matlab, the efficiency should be significantly improved on some other programming platforms such as C and /or C++.

    There are two main reasons why COREG is less efficient than GI-PIIFD. First, two rounds of matching corner triplets are needed in GOREG, while there is only one round in GI-PIIFD. Second, compared with GI-PIIFD, additional time is needed in COREG for refining localization which has been discussed in Section 5.5. However, COREG is more efficient in building local descriptors than GI-PIIFD. The local descriptor in GI-PIIFD is 128-dimensional, whereas only the curvature and 16-dimensional DEPAC descriptor are used for describing corners in COREG.

  2. ii.

    As the scale difference in an image pair increases, COREG achieves comparable or even higher efficiency than GI-PIIFD.

    When the scale difference increases, the space of geometric transformations becomes increasingly larger. Accordingly, more time will be needed for comparing corner triplets. In COREG, the reference and target images have similar scales after applying the estimated scale difference. Thus, the second round of comparing corner triplets in COREG is much faster than the first round.

6.5 A discussion on corner triplets

As introduced in Section 5.1, the proposed technique essentially compares pairs of corner triplets from two images and estimates the optimal geometric transformation to do the final alignment. All possible pairs of corner triplets are compared and ranked in terms of NOP values. The triplet pair which holds maximum NOP value leads to the estimated transformation between two images. Hence, how two images are aligned depends on the correctness of the triplet pair with maximum NOP value, rather than the number of triplet pairs. Based on our analysis, a correct transformation can be estimated as long as there exist at least one corresponding triplet pair between two images. If this condition is not met, it is recommended to adjust default settings of the corner detector used [18], as discussed in Section 5.6. By doing so, sufficient number of corresponding triplet pairs are generated, thereby estimating a correct transformation between two images.

7 Conclusion

We have presented a novel multimodal image registration technique based on corners. To address large content differences in multimodal images, we have explored curvatures of corners and have proposed a novel corner descriptor for feature representations. The proposed feature representations are independent of intensity and gradient changes in multimodal images. Moreover, we have proposed a simple yet effective way of estimating the scale difference between the reference and target images. The scale estimation is achieved with the assistance of a pair of corner triplets which leads to optimal transformation between the reference and target images. Our experimental results have shown that our proposed technique achieves much greater robustness in both content and scale differences as compared to state-of-the-art multimodal image registration techniques.

Without the loss of generality, COREG is suited for registering all kinds of multi-modal images in which transformations may include scale, rotation, translation, blur and illumination, etc. Moreover, our proposed DEPAC corner descriptor is applicable to various applications such as object recognition [5], image retrieval [6] and robot localization [11]. Our future work includes developing image representations which are robust to various transformations including occlusion and deformation.