1 Introduction

A wide-view, high-resolution image is usually required in many fields. At present, there are two main ways to acquire these images, first is to fix the shaft of the camera and taken photos while pivoting; second is to fix the light centre of the camera, moving the lens horizontally and taking photos [1]. To merge these images to a wide-view, high-resolution image, it is necessary to study image stitching techniques.

Image stitching, or also known as image mosaic, is a process that combines images with overlapped areas to form an image with wide view and high resolution [2]. Nowadays, image stitching plays a vital role in digital image processing, making it a popular domain in photographic cartography, computer vision, image processing and computer graphics. It is widely applied in remote sensing, aerospace, virtual reality, medical imaging and so on [3,4,5,6].

A well-stitched image should be clear, has smooth edge and high resolution. The main process of image stitching consists of feature matching, registration and seam removal. Take two images as an example, the stitching process is shown in Fig. 1. Unfortunately, there is not a common method for stitching all kinds of images at present because of the complexity of the image. However, the core problem is to find the overlapped area and determine the position relationship between two images.

Fig. 1
figure 1

Basic flow chart of image stitching

Usually, after inputting two images with overlapped areas, feature matching is applied to find the correspond points of the images for stitching. Image registration is done to put the images under the same coordinate system. Seam removal is employed to eliminate the seam, because the outside environments and camera settings (like exposure) when those images are acquired may be different. Therefore, image registration and seam removal are key techniques to image stitching.

With the development of new techniques, demands to time and precision have been refreshed. In recent years, lots of achievements have been made in this field. For example, scale-invariant feature transform (SIFT) and speed-up robust feature (SURF) methods improve the precision of image registration to a certain degree, optimal seam methods make the transition between two images more natural.

In this article, some algorithms about image stitching, including image registration and seam removal, are shown in Sects. 2 and 3, respectively. Next, some quality assessment methods for stitched image are provided in Sect. 4. Discussions are made in Sect. 5. Finally, conclusions are made in Sect. 6.

2 The key to image stitching: image registration

Image registration is a process that combines different images for the same area obtained from different time, different views and different sensors altogether [7]. It can be single modality or multi modalities, rigid or non-rigid. In rigid registration, images are simply supposed to be rotated and translated to be aligned properly. In non-rigid registration, two images cannot be aligned properly without localized stretching of images. The methods commonly used in image stitching are single modality, rigid ones which mainly categorized as method based on regions and method based on features. Method based on regions starts from the pixel value and calculates the intensity difference in the same area between image to be registered and reference image mathematically. In general, such methods have a large calculation amount but low registration accuracy. Method based on features does not make use of pixel value directly, it extracts features in images and matches the overlapped area using those features extracted. This kind of method is of robustness and precision.

2.1 Image registration based on regions

2.1.1 Pixel matching algorithm

Basic pixel matching is operated on monochrome images. The algorithm selects one image as a base, and the overlapped area is employed to be registered and determines the matched regions. Some methods use one image as reference and make the overlapped area move inside the reference image. When it moves over one pixel, relativity is calculated. Suppose I1 is the referrer, I2 is another image, the overlapped areas are known as Ω1 and Ω2, whose sizes are both M × N. The relativity R can be known as

$$ R = \sum\limits_{x = 1}^{M} {\sum\limits_{y = 1}^{N} {\left| {\Omega_{1} (x,y) - \Omega_{2} (x,y)} \right|} } . $$
(1)

Then, the row (column) with maximum relativity is selected for registration. Obviously, the calculate amount is large. For an M × N sized overlapped area, the size of D will be M × N as well. It is also difficult for such methods to deal with rotation and scaling.

Figure 2 are two images with overlapped area and Fig. 3 is the stitching result using such method. Note that grey image is employed for calculation but map the results into the RGB domain. In this method, relativity in a window is kept and calculated to find the suitable position to align two images. As the method cannot deal with rotation and scaling, the images are kept as their original size and style. The relativity of Fig. 2a, b has high relativity at the place denoted by an arrow with high relativity so it is chosen as the base for aligning the image. The relativity of the place denoted by an arrow with low relativity is low and is thus, not chosen as the base. From Fig. 3, we can see that there are obvious discontinuities in the overlapped area. This is because that the lens was rotated into a different angle when photographing the right image (see Fig. 2b), while the matching method cannot deal with rotations in images. To register this image better, some methods that could solve the rotation of the images should be applied.

Fig. 2
figure 2

Images with overlapped area

Fig. 3
figure 3

Image stitching using relativity measurement

To reduce calculate amount, the search field is modified to a region and put to practice in [7, 8]. An image stitching method based on cross-correlation is also proposed in [9] to compute the displacements between the source images.

An image stitching method using dynamic time warping (DTW) is proposed in [10]. DTW is a method aligning two sequences of time series by matching their elements while minimizing a total distance measure via dynamic programming. The method starts by selecting two images with overlapped area. Then, an m × n sub-image is cropped from one of them as template, where features are extracted and normalized into a 1-dimensional sequence. Features of another image are also extracted and compared with DTW method to identify the position. Finally, the stitched image is created.

In [11], two rows with certain interval in the overlapped areas of an image I1 are used as the template to search the best matching points in the correspond overlapped area in I2. The rows of match are then adjusted to three to improve the searching strategies and make the result more precise. In [12], a technique for stitching biomedical images is presented. This algorithm operates by equilibrating neighbouring edges and forcing the brightness at corners to a common value. Brightness correction is done by a specific matrix before registration in [13] to make the algorithm less sensitive to illumination changes.

In general, this kind of method is simple to realize, but the calculation amount is large. Because the image is often taken with deformations, while no transformation is made in registration process, the result is not so ideal as expected.

2.1.2 Image registration based on mutual information

Mutual information (MI) is the scale of dependence between two variables [14]. If two variables are independent, the mutual information is 0. If two variables share strong dependence, the mutual information is large. In recent years, MI is widely applied in image registration [15,16,17], image retrieval [18], pattern recognition [19]and image stitching [20, 21].

The MI of images I1 and I2 is defined as

$$ {\text{MI}}(I_{1} ,I_{2} ) = \sum\limits_{{I_{1} ,I_{2} }} {P(I_{1} ,I_{2} )\log_{10} \frac{{P_{{I_{1} I_{2} }} (I_{1} ,I_{2} )}}{{P_{{I_{1} }} (I_{1} )P_{{I_{2} }} (I_{2} )}}} , $$
(2)

where \(P_{{I_{1} I_{2} }} (I_{1} ,I_{2} )\) is the joint probability density, \(P_{{I_{1} }} (I_{1} )\) and \(P_{{I_{2} }} (I_{2} )\) are marginal probability densities.

According to the definition of MI, two images are supposed to be variables, they share large MI if the similarities between the images are large. Thus, the image registration problem now transforms into a problem to search the max value of MI between the template and the sub-image.

To improve the MI algorithm, some concepts, such as NMI (normalized mutual information) [22], LMI (local mutual information) [23], are proposed. A self-similarity weighted graph-based implementation of α-mutual information (α-MI) for non-rigid image registration is proposed in [16] and is applied into registering pre-operative magnetic resonance (MR) images to intra-operative ultrasound (US) images. α-MI is used to detect the similarities between images.

Registration using MI is of high precision, low interference and strong reliability. However, the calculation is very complex, and the robustness to rotation is weak.

2.1.3 Image registration based on Laplacian pyramid

Image registration based on Laplacian pyramid was first proposed in [24], where a sequence of low-pass filtered images I1, …, ILdec can be obtained by repeatedly convolving a small weighting function with a sample image, I0. By this method, image sample density is also decreased with each iteration so that the bandwidth is reduced in uniform one-octave steps. Sample reduction also means that the cost of computation is controlled to a minimum size. The pixel value of I1 corresponds to the 5 × 5 weighted mean value of transition in I0 level. Iterate according to this and get a series of low pass filter image (the Gauss low pass filter image series): I0, …, ILdec. The pixel of the lth level in image Il (x, y) is defined as

$$ I_{l} (x,y) = \sum\limits_{i = 1}^{5} {\sum\limits_{j = 1}^{5} {w(i,j)I_{l - 1} (2x + i,\;2y + j)} } . $$
(3)

w(i, j) is called generating kernel limited by

$$ \sum\limits_{{i = { - }2}}^{2} {\sum\limits_{j = - 2}^{2} {w_{l} (x - 2^{l} i,y - 2^{l} j)} } = 1 $$
(4)

A localized algorithm of image stitching based on classic Laplacian method is proposed in [25]. Multi-resolution template, which can calculate multi templates in one scanning process and get the descriptor in different scales, is proposed. Experiments show that the use of structure information limits the decomposition and reconstruction process and reduce the calculation amount.

2.1.4 Registration in the transformation domain

In the transformation domain, methods commonly used to register images are Fourier Transformation (FT) and Wavelet Transformation (WT). Here, Fourier Transformation based registration is introduced at first. Then, Wavelet based registration is also introduced.

In 1970, FT was used in image registration [26]. In [27], FT is applied into underwater sonar image stitching. In [28], it is used for stitching computed tomography (CT) image sets.

An image registration method based on WT is proposed in [29]. Dual-tree complex wavelet transformation (DT-CWT) is applied to decompose the images to be registered. Starting at the coarsest level, a first estimate of the transformation vector v = [α, tx, ty] is found, where α is the rotation angle, tx and ty are the translation parameters in the x and y directions, respectively. Note that in this work, scaling transformation is omitted. Cross correlation is chosen as a matching criterion. Next, edge maps of both low-passed images, Emap1 and Emap2, are extracted. Searching is performed to determine the best initial transformation vector vinit:

$$ v_{{{\text{init}}}} = \mathop {\arg \max }\limits_{v} c(Emap_{1} ,T(Emap_{2} )), $$
(5)

where c(x, y) means cross correlation with x and y, T(Emap2) denotes the wrapped image using vector v.

The transformation vector vl is found by

$$ v_{l} = \mathop {\arg \max }\limits_{v} \left[ {\begin{array}{*{20}c} {I({\text{Re}} \{ D_{1} ,l\} ,T({\text{Re}} \{ D_{2} ,l\} ))} \\ {I\left( {\left\| {Cp\{ D_{1} ,l\} } \right\|,T\left\| {Cp\{ D_{2} ,l\} } \right\|} \right)} \\ \end{array} } \right]. $$
(6)

Re{x, l}[21] and Cp{x, l} are real and complex part of x in level l, respectively. D1 means the decomposing of original image and D2 means the decomposing of image to be registered. To further reduce the computational burden, an alternative search method proposed can be applied.

Registration in the transformation domain is fast, robust and can solve most deformations in images. However, when the image has few overlapped areas, the result may not be precise enough.

2.2 Image registration based on features

2.2.1 Corner detecting algorithm

The corner that is usually generated from intersections between lines, is usually the point where there is a sharp difference in brightness or has huge curvature [30, 31]. It is scale-invariant and influenced little from illumination. Therefore, corner is suitable for detection. The operators generally used are Forstner, Moravec and Harris.

Forstner operator was firstly proposed by Wolfgang Forstner in [32], where q and w are calculated by

$$ q = \frac{4\det (C)}{{({\text{trace}}(C))^{2} }} $$
(7)
$$ w = \frac{\det (C)}{{{\text{trace}}(C)}} $$
(8)

where C is the intensity covariance matrix. Suppose t is the threshold (a certain value). If q > t, the pixel is one candidate. Then, extreme points of q are selected as feature. The best candidate window is the centre window of that point. Normal equations are used to calculate the accurate position of the corner, where gx and gy are Robert gradient in x and y directions, respectively:

$$ C\left[ \begin{gathered} x \hfill \\ y \hfill \\ \end{gathered} \right] = \left[ \begin{gathered} x\sum {g_{x}^{2} + } y\sum {g_{x} g_{y} } \hfill \\ x\sum {g_{x} g_{y} + y\sum {g_{y}^{2} } } \hfill \\ \end{gathered} \right] $$
(9)

In 1980, Moravec operator was formally proposed in [33]. It selected the point with minimum intensity variance in four main directions.

Harris operator was proposed in [34] based on Moravec method [35]. In Harris operator, corner detector Dh is used to detect corners:

$$ D_{{\text{h}}} (i,j) = \sum\limits_{x} {\sum\limits_{y} {w(x,y)[I(x + i,y + j) - I(x,y)]^{2} } } , $$
(10)

where I(x, y) is one pixel in image I. w(x, y) is the window function, which could be a rectangle window or Gaussian window. Suppose

$$ M = \sum\limits_{x} {\sum\limits_{y} {w(x,y)} \left[ {\begin{array}{*{20}c} {I_{x}^{2} } & {I_{x} I_{y} } \\ {I_{x} I_{y} } & {I_{y}^{2} } \\ \end{array} } \right]} $$
(11)

Ix and Iy are gradient values in x and y directions. The detect operator can be approximated as

$$ D_{{\text{h}}} (i,j) = \left[ {\begin{array}{*{20}c} i & j \\ \end{array} } \right]M\left[ \begin{gathered} i \hfill \\ j \hfill \\ \end{gathered} \right]. $$
(12)

The response value Re of the detection is

$$ R_{{\text{e}}} = \det (M){ - }k[{\text{trace}}(M)]^{2} . $$
(13)

Usually, the value of k is 0.04–0.06 [36]. Set k = 0 to the k value less than a threshold t and suppress the non-maximum value in a 3 × 3 or 5 × 5 area, whose local maximum point is the corner.

Harris operator is invariant to translation and rotation. However, the value of t is dependent on image attributes, making no measurement when setting specific threshold. In addition, the points with large eigen values are often assembled in certain range, which will make the distribution of the detected corner in a non-unified way. As more corners can be detected in regions with rich texture, while fewer corners can be detected in regions with less texture, the corners tend to distribute in the place with richer texture information [30, 37].

To improve Harris corner detection, a new response function is given in [38]. Re is redefined as:

$$ R_{{\text{e}}} = \frac{\det (M)}{{({\text{trace}}(M))^{2} + \varepsilon }}, $$
(14)

where ε is used to prevent the case when trace(M) is 0. A very small ε, like 10–6, can be taken. In addition, in the redefined Re, the selection of k is not needed.

There are some applications of corner detecting algorithms in image stitching. In [39], an adaptive Harris corner detection for image stitching is given. In [40], Harris corner detector is customized to meet the requirements of viewing a panorama video in real time. In [41], corner detection is adopted to help extract features. In [42], an improved Harris corner detection algorithm is proposed to stitch the images related to traffic accidents.

2.2.2 Scale-invariant feature transform (SIFT) method

SIFT method was proposed by David G. Lowe in 1999 and was used in feature detections at first [43, 44]. In 2004, the algorithm was upgraded [45]. SIFT has been used in image registration [46,47,48,49] and image stitching [50,51,52,53,54]. SIFT algorithm detects the extreme value in the intensity and scale domain and selects the key point using Gaussian kernel G. The difference of Gaussian kernel (DoG) in different scale σ is used to detect extreme value:

$$ G(x,y{,}\sigma ) = \frac{1}{{\sqrt {2\pi } \sigma e^{{\frac{{x^{2} + y^{2} }}{{2\sigma^{2} }}}} }} $$
(15)
$$ \begin{aligned} DoG(x,y,\sigma ) & = [G(x,y,k\sigma ) - G(x,y,\sigma )]*I(x,y). \\ & = L(x,y,k\sigma ) - L(x,y,\sigma ) \\ \end{aligned} $$
(16)

For every L(x, y, σ), the amplitude and direction of the key point are then computed. Then, an 8 × 8 window centred by the key point is taken and the 8-direction gradient histogram of every 4 × 4 direction is calculated. The key point descriptor is a 128-dimensional vector.

Key point descriptors are matched using measurements like Euclidian distance. After the matching, random sample consensus (RANSAC) is usually taken to remove mismatches [55].

Then, the image is taken for affine transformations, where one image can be combined with another. Usually, a homography matrix is adopted in this step to describe the class of 2-dimensional planar projective transformations by matrix multiplication. The projective transformations can be defined as a matrix with 8 degrees of freedom called H:

$$ \left[ \begin{gathered} {x^{\prime}} \hfill \\ {y^{\prime}} \hfill \\ 1 \hfill \\ \end{gathered} \right] = H\left[ \begin{gathered} x \hfill \\ y \hfill \\ 1 \hfill \\ \end{gathered} \right] = \left[ {\begin{array}{*{20}c} {a_{1} } & {a_{2} } & {a_{3} } \\ {a_{4} } & {a_{5} } & {a_{6} } \\ {a_{7} } & {a_{8} } & 1 \\ \end{array} } \right]\left[ \begin{gathered} x \hfill \\ y \hfill \\ 1 \hfill \\ \end{gathered} \right], $$
(17)

where x′ and y′ can be obtained as

$$ \left\{ \begin{gathered} x^{\prime} = \frac{{a_{1} x + a_{2} y + a_{3} }}{{a_{7} x + a_{8} y + 1}} \hfill \\ y^{\prime} = \frac{{a_{4} x + a_{5} y + a_{6} }}{{a_{7} x + a_{8} y + 1}} \hfill \\ \end{gathered} \right.. $$
(18)

By affine transformation and finding appropriate parameters, the two images can be combined with their main features matched altogether. The homography matrix is determined by eight parameters and the coordinate of points should be selected to estimate them. However, in pixel-based algorithms, the image registration is employed for selecting a template from one image and measure the relativity in another. Thus, it is unable to take the affine transformation. This is the reason why Fig. 3 has large mismatches.

Image stitching with SIFT is in good conditions in many cases. It runs very fast and can deal with translation, rotation and scaling. Figure 4 is the matched result with SIFT. Note that the horizontal lines mean the corresponding points in the two images. Figure 5 is the stitched result of Fig. 2 with SIFT.

Fig. 4
figure 4

Matched result with SIFT

Fig. 5
figure 5

Stitched image with SIFT

SIFT is improved in [56], known as PCA-SIFT. It accepts the same input as the standard SIFT descriptor. PCA [57] is the short for principal component analysis. It is a kind of technique for dimensionality reduction and has been widely applied in computer vision [58,59,60]. There are mainly three steps for PCA-SIFT: first, an eigen space is computed to express the gradient images of local patches; second, the local image gradient of a given patch is calculated; third, the gradient image vector is projected using the eigen space to derive a compact feature vector.

SIFT descriptor is invariant to image deformations such as translations, rotations and scaling. It is also robust to moderate perspective transformations and illumination variations. However, it cannot meet the demand of calculation speed in some cases. As an improvement, PCA-SIFT consumes less time than SIFT with the cost of lower robustness to scaling.

2.2.3 Speed up robust feature (SURF) algorithm

SURF, a robust image recognition algorithm, was firstly proposed in 2006 [61]. It can be used in pattern recognition and 3D reconstruction. Experiments show that compare with SIFT, SURF will be serval times faster [62]. In recent years, studies on image registration [63, 64]and image stitching [65,66,67] based on SURF demonstrate good results.

SURF takes use of the integral image to convert the second-order Gaussian integral template filter into the modification of the integral image.

Hessian matrix, H(x, σ) in x at scale σ, is then applied to detect feature points:

$$ H(x,\sigma ) = \left[ {\begin{array}{*{20}c} {L_{xx} (x,\sigma )} &\quad {L_{xy} (x,\sigma )} \\ {L_{xy} (x,\sigma )} & \quad{L_{yy} (x,\sigma )} \\ \end{array} } \right] $$
(19)

Lxx, Lxy, Lyy are the second-order derivations in xx, xy, yy directions, respectively, which can be acquired from convoluting the image with second-order Gaussian gradient template. In SURF, this template is approximated by box functions, usually a 9 × 9 box function that is equivalent to a Gaussian filter (σ = 1.2). Such approximation is called blob response maps expressed by Bxx, Byy and Bxy.

Thus, det(H) can be approximated by

$$ \det (H) = B_{xx} B_{yy} - w(B_{xy} )^{2} . $$
(20)

Next, feature descriptors are created using Haar Wavelet. Integral image is used to simplify the calculation. The descriptor is a 64-dimensional matrix with coordinate, amplitude and rotation angle. Figure 6 is the matching points found with SURF, where the horizontal lines mean the corresponding points in the two images. Figure 7 is stitching result of Fig. 2 using SURF.

Fig. 6
figure 6

Matched result with SURF

Fig. 7
figure 7

Stitched image with SURF

Compare Fig. 6 and Fig. 4, we can see that SURF finds fewer feature points than SIFT. The applications of Hessian matrix and box function make SURF faster than SIFT. However, compare with SIFT, SURF is not robust against large rotations.

Template-convolution speed-up robust features (TSURF), an improvement of SURF, is proposed in [68] to improve the stitching speed while maintaining the precision. The maximum value (Hmax) of Hessian matrix’s determinant minus the secondary value (Hsec) is used:

$$ H_{\max } - H_{\sec } \ge \theta . $$
(21)

In their algorithm, θ is set to 0.5 when the image contrast is 0.14. If Hmax − Hsec ≥ θ, the point is assumed as a candidate. The higher the contrast is, the greater the θ will be.

Registration based on SURF consumes less time than registration based on SIFT. However, it is less robust than SIFT to rotations in images. In addition, misalignments with SURF are more than using SIFT.

2.2.4 KAZE features

KAZE feature is a multi-scale 2-dimensional feature detection and description algorithm in nonlinear scale spaces proposed in 2012 [69], where 2-dimensional features are detected and described in a nonlinear scale space by nonlinear diffusion filtering. In KAZE feature description, the nonlinear scale space up to a maximum evolution time is built at first. Then, the 2-dimensional features, which exhibit the max value of the scale-normalized determinant of the Hessian response through the nonlinear scale space, are detected. Finally, the main orientation of the key point is calculated to obtain a scale and rotation invariant descriptor.

In KAZE feature description, the scale space is discrete in logarithmic as a series of O octaves and S sub-levels. The octave sets, and sub-levels are identified by o, a discrete octave index, and s, a sub-level octave index. They are mapped to their corresponding scale ε as

$$ \varepsilon_{i} (o,s) = \varepsilon_{0} 2^{{o + \frac{s}{S}}} ,\quad o \in [0,O - 1],\;s = [0,S - 1],\;i \in [0,N], $$
(22)

where ε0 is the base scale level, N is the total number of filtered images.

Then, the scale levels in pixel units are converted into time units with

$$ t_{i} = \frac{{\varepsilon_{i}^{2} }}{2},\quad i \in [0,N]. $$
(23)

An input image is firstly convoluted with a Gaussian kernel with standard deviation ε0 to reduce noise and possible image artefacts. Then, its gradient histogram is computed to obtain the contrast parameter k in an automatic procedure.

Given the contrast parameter and the set of evolution times ti, the non-linear space is built as

$$ L^{i + 1} = \left( {I - (t_{i + 1} - t_{i} )\sum\limits_{l = 1}^{m} {A_{l} L_{i} } } \right)^{ - 1} L_{i} . $$
(24)

To detect the interest points, the response of scale-normalized determinant of the Hessian matrix is computed at multiple scale levels:

$$ L_{{{\text{Hessian}}}} = \varepsilon^{2} (L_{xx} L_{yy} - L_{xy}^{2} ), $$
(25)

where Lxx and Lyy are the second-order horizontal and vertical derivations, respectively, Lxy is the second-order cross derivation.

Then, the maximum value in scale and spatial location is found. The extrema are also found in all the filtered images (except the top and the end levels). Each extremum is searched over a rectangular window of size εi × εi on the current i, upper i + 1 and lower i − 1 filtered images.

Finally, the descriptors are generated. The dominant orientation is found in a circular area of radius 6εi with a sampling step of size εi. First-order derivations Lx and Ly are weighted with a Gaussian centred at the interest point for each of the samples in the circular area. Then, the derivation responses are represented as points in vector space. The dominant orientation is found by summing the responses within a sliding circle segment covering an angle of π/3. From the longest vector, the dominant orientation is obtained.

For a detected feature with εi scale, first-order derivation Lx and Ly of size εi are computed over a 24εi × 24εi rectangular grid divided into 4 × 4 sub-regions of size 9εi × 9εi with an overlap of 2εi. The derivation responses in each sub-region are weighted with a Gaussian (εl = 2.5εi) centred on the sub-region centre, and summed into a descriptor vector dy = (ΣLx; ΣLy; Σ|Lx|; Σ|Ly|). Then, each sub-region vector is weighted using a Gaussian (ε2 = 1.5εi) defined over a 4 × 4 mask and centred on the interest key point. When considering the dominant orientation of the key point, each of the samples in the rectangular grid is rotated according to the dominant orientation. The derivations are also computed according to the dominant orientation. Finally, the 64-dimensional descriptor vector is normalized into a unit vector to achieve contrast invariance.

Experiments show that KAZE feature description is robust to noise and blurring. However, it adopts non-linear diffusion in a low level and it costs more time than SURF. To reduce the calculate amount, A-KAZE, the short for accelerate KAZE, is proposed [70]. It uses the thought of fast explicit diffusion (FED) in a pyramidal framework and binary descriptors [71] to speed up the calculation and is suitable for portable devices.

Once KAZE feature is proposed, many researches are focused on its application. In [72], KAZE feature is used to matching the feature points of frame-patch in videos. In [73], S-AKAZE is proposed as a modification of KAZE to do image matching. In [74], A-KAZE is applied for object recognition. In [75], KAZE is taken to assist image segmentation. In [76], A-KAZE is adopted for image stitching.

2.3 Assessment to image registration descriptors

In this section, four main image registration descriptors (Harris, SIFT, SURF, KAZE) are taken to register two images, as shown in Fig. 2. The metrics of the matching points are set as Euclidian distance. For one key point in one image, two correspond points in another image is found. If the ratio between the distance of the second nearest point and the nearest point is less than a threshold, such two points are considered as correct matched points. The ratio threshold of Euclidian distance is set as 0.6. The mismatching points are all removed by the same method (RANSAC) to form correct matching points. The correct rate is the percentage of correct matching points in the matching points found. The results are shown in Table 1.

Table 1 Assessment to image registration methods

From the table, we can see that KAZE finds the maximum number of matching points and correct matching points among the four descriptors. SURF finds the minimum number of matching points and correct matching points. Among the four methods, it is SIFT that has the highest correct rate in the matching points found. Thus, it is considered to be the best-performed descriptor.

3 The key to image stitching: seam removal

As image registration is the most important issue to a successful stitching, many literatures pay more attention to image registration rather than seam removal. However, seam removal is also a critical issue in image stitching as humans are sensitive to the seam in an image, as shown in Fig. 8.

Fig. 8
figure 8

Stitched image without seam removal

Seam removal can improve the general quality of the stitched image, and make the information in the stitched image more accurate. Therefore, it is necessary to remove the obvious seam caused by brightness inconsistency of input images after finishing registration. The seam usually appears at the border of two images, making a bigger pixel variance or gradient difference. Essential seam removal needs to reduce such variance (difference) to an acceptable range, making the transition more smoothly in the overlapped area.

Such process is something like image fusion, a process that fuses images from different sensors into an integrated one using mathematical method to meet certain demands [77]. What is different from image fusion is that seam removal is only the adjustment of brightness and contrast, extra information is not added after such process. Classified from method, there are methods based on pixel weighing, optimal seam method and transformation domain. Seam removal based on pixel weighting performs directly on the overlapped area but cannot get the ideal result when there are large differences in outside environments or camera settings in the two images.

Thus, optimal seam methods are to find an optimal seam by optimizing an objective function that minimizes the difference near the overlapped area. Seam removal in the transformation domain decomposes the image via specific transformation at first and apply different rules to different bands, then the image after seam removal is reconstructed and output.

3.1 Seam removal based on pixel weighting

In many references, weighting based seam removal technique [78, 79] is usually employed

$$ I(x,y) = a\Omega_{1} (x,y) + (1 - a)\Omega_{2} (x,y) $$
(26)

where a is a coefficient between 0 and 1. Ω1(x, y), Ω2(x, y) are the pixel values of I1(x, y) and I2(x, y) in overlapped area Ω, respectively. At the beginning, a is a fixed value, but such setting cannot strengthen the interested parts in an image [80].

Thus, a is adaptive. When a varies from 1 to 0, two images can transit smoothly. The selections of a is shown clearly in [81]. Figure 9 is the seam removal result with pixel weighting. From Fig. 9, we can see that this kind of weighting function is generally rough, making the seam still visible. In some cases, like two images, as shown in Fig. 10, it will cause ghost artefacts and cannot meet the high demands for precision (Fig. 11) (shown in Fig. 12).

Fig. 9
figure 9

Stitched image with pixel weighting

Fig. 10
figure 10

Another set of images with overlapped area

Fig. 11
figure 11

Stitched image without seam removal

Fig. 12
figure 12

Stitched image with pixel weighting

3.2 Seam removal based on optimal seam methods

The optimal seam method searches for the optimal seam in the overlapped area to create mapping or labelling between pixels in the composite and source images. The optimal seam is found by optimizing an objective function that minimizes the difference near the overlapped area. To eliminate seam more precisely, some optimal seam methods are proposed in [82,83,84,85,86,87,88].

An optimal seam finding method is proposed in [82]. Suppose I1 and I2 are two images with overlapped areas, the optimal seam is a connected path traversing across the overlap that starts on the first row/column and ends on the last row/column. By adopting a more compact and descriptive data structure called tensor (a symmetric and positive semi-definite matrix T), the weights of pixels in the overlapped area are optimized. Riemannian norm is taken to measure the distance between two tensors.

The seam cost function Cost in an image whose height is M is defined as

$$ {\text{Cost}} = {\text{Cost}}(I_{{\text{s}}} ) = \sum\limits_{i = 1}^{M} {c(I(s_{i} ))} , $$
(27)

where I(si) is the ith pixel on seam s. c(i, j) is defined as

$$ c(i,j) = d(T_{1} (i,j),T_{2} (i,j)) + \left| {I_{1}^{\prime \prime } (i,j) + I_{2}^{\prime \prime } (i,j)} \right|, $$
(28)

where T1, T2 are tensors of I1 and I2, I″ is the 2nd derivation of image I. The optimal seam is defined as

$$ s^{*} = \mathop {\min }\limits_{s} {\text{Cost}}(s) = \mathop {\min }\limits_{s} \sum\limits_{i = 1}^{M} {c(I(s_{i} ))} . $$
(29)

The demos of this method are shown in Figs. 13 and 14.

Fig. 13
figure 13

Seam removal using optimal seam method only

Fig. 14
figure 14

Seam removal using optimal seam method only

Some other optimal seam methods also show better results. In [83], optimal seam method is used to handle the ghost artefacts in moving object problems. In [84], an improved seam finding method is proposed by downscaling the overlapped areas to approximate the seam. In [86], a new energy aggregation and traversal strategy is adopted to find an optimal seam for image stitching. In [86], optimal seam lines between adjacent input images are detected via the graph cut energy minimization framework. In [87], the information of image colour, gradient magnitude and texture are fused into graph cuts. In [88], an optimal seam pair is selected by comparing the cross correlations from multiple seams detected by variable cost weights.

3.3 Seam removal in the transformation domain

Seam removal in the transformation domain decomposes the image via specific transformation and applies different rules to different frequency bands. By doing seam removal in the transformation domain, the seam in the stitched image can be reduced. The common transformations are Fourier transformation, wavelet transformation and Curvelet transformation.

To remove seam with two images, the common practices are decomposing the images in the transformation domain at first to achieve N level of decomposing and coefficients in different frequencies in correspond levels. Then, apply different rules to the high frequency and low frequency bands, respectively, to preserve both the contour and the detail information as the high frequency bands imply the detail, while the low-frequency bands imply the contour. Finally, the image is reconstructed by the inverse transformations.

There is a method for seam removal using Fourier Transformation in [89]. Besides Fourier transformation, Wavelet transformation can also be applied in seam removal [90, 91]. A method suitable for seam removal based on Wavelet transformation is provided in [90], where Wavelet fusion is taken to create a seamless underwater stitching. In [91], Wavelet transformation is used to decompose the images so that the transition of the overlapped area can be smooth enough.

Figures 15 and 16 are the seam removal results of overlapped area using Wavelet transformation. The low frequency bands are obtained using the average value of two parts and the high frequency bands are obtained using the maximum value of two parts. The seam has been greatly reduced compare with Fig. 8, but still more visible than Fig. 11.

Fig. 15
figure 15

Seam removal with Wavelet transformation (cise)

Fig. 16
figure 16

Seam removal with Wavelet transformation (rail)

Besides Fourier and Wavelet transformation, Curvelet transformation can also be applied to remove seam. Being an extension of Wavelet, Curvelet is a non-adaptive technique for multi-scale object representation. It was proposed in 1999 and improved in 2002 [92]. Compare with Wavelet, a curve can be represented by Curvelet in a smoother way [93]. Curvelet transformation has been widely applied in image processing, such as image denoising [94, 95], image enhancement [96] and image fusion [93, 97, 98].

When γ is a curve with second derivations, Curvelet transformation is proved in [99] to have an error bound as

$$ \left\| {f - Q_{{\text{N}}}^{{\text{C}}} (f)} \right\|_{2}^{2} \le CN^{ - 2} \log^{\frac{1}{2}} N, $$
(30)

where \(f(x) = g(x)_{{l\{ x_{2} \le \gamma (x_{1} )\} }}\), and \(Q_{{\text{N}}}^{{\text{C}}}\) is the non-linear approximation of N items in the Curvelet transformation of f. The seam removal results with Curvelet transformation are shown in Figs. 17 and 18.

Fig. 17
figure 17

Seam removal with Curvelet transformation (cise)

Fig. 18
figure 18

Seam removal with Curvelet transformation (rail)

In [93], an image fusion technique based on Curvelet transformation is proposed. Local energy-based fusion rule, which is more effective than single pixel-based fusion rules, are taken into application. In [97], a Curvelet based approach for image fusion is proposed. The image is segmented into small tiles to show detail information. Ridgelet transformation is applied on each of these tiles and used in the fusion process. In [98], Curvelet transformation is combined with Genetic Algorithm for medical image fusion.

4 Quality assessment of image stitching

It plays a key role to assess a stitched image to improve the stitching algorithm and judge whether the stitching results correspond to the vision of human eyes. Image quality assessment methods contain subjective methods and objective methods [100]. In subjective assessment, experiments are needed to make testers assess the image, while in objective assessment, algorithms are applied to assess the quality of an image. The main algorithms in objective assessment are PSNR (Peak Signal to Noise Ratio) [101]and SSIM (Structural SIMilarity index) methods [102, 103].

PSNR is an engineering term for the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. It is usually applied to measure the quality of reconstruction of lossy compression codecs (e.g., for image compression). PSNR is defined by mean squared error (MSE) as

$$ {\text{PSNR}} = 10\log_{10} \left( {\frac{{{\text{MAX}}_{I}^{2} }}{{{\text{MSE}}}}} \right), $$
(31)

where MAXI is the maximum possible pixel value of an image. When the pixels are represented using 8 bits per sample, it is 255. For an M × N sized image I and its noisy approximation K, MSE is defined as

$$ {\text{MSE}} = \frac{1}{MN}\sum\limits_{x = 0}^{M - 1} {\sum\limits_{y = 0}^{N - 1} {(I(x,y) - K(x,y))^{2} } } . $$
(32)

PSNR approximates human perception of reconstruction quality. In PSNR method, when the reference image is the same with the image under assessment, the PSNR value is (inf). Typical values for the PSNR in lossy image compression are between 30 and 50 dB if the bit depth is eight bits, where higher is better. For 16-bit data, typical PSNR values are between 60 and 80 dB [104].

SSIM is a full reference metric predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. It is also applied to measure the quality of images. SSIM is calculated on various windows of an image. The measure between two windows x and y of with size N × N is

$$ {\text{SSIM}}(x,y) = \frac{{(2\mu_{x} \mu_{y} + c_{1} )(2\sigma_{xy} + c_{2} )}}{{(\mu_{x}^{2} + \mu_{y}^{2} + c_{1} )(\sigma_{x}^{2} + \sigma_{y}^{2} + c_{2} )}}, $$
(33)

where μx and μy are the mean values of x and y, σx, σy are the variances of x and y, σxy are the covariance of x and y. c1 is set to be (k1L)2 and c2 is set to be (k2L)2. L is the dynamic range of the pixel-values (usually 28–1 = 255 in eight bit image) and k1 = 0.01, k2 = 0.03. SSIM value is between 0 and 1, the higher value means that the image under assessment has higher quality. When the reference image is the same with the image under assessment, the SSIM value is 1. Tables 2 and 3 are assessment results using PSNR and SSIM, respectively.

Table 2 Assessment of stitched images using PSNR
Table 3 Assessment of stitched images using SSIM

In tables cise denotes the image stitched with Fig. 2 and rail denotes the image stitched with Fig. 10. Images with seam removal using optimal seam [55] (for cise, it is Fig. 13; for rail, it is Fig. 14) are taken as the reference. original means image stitched without seam removal (for cise, it is Fig. 8; for rail, it is Fig. 11). weighted means image stitched with pixel weighting [78]to remove seam (for cise, it is Fig. 9; for rail, it is Fig. 12). wavelet means image stitched with Wavelet transformation [90] to remove seam (for cise, it is Fig. 15; for rail, it is Fig. 16). curvelet means image stitched with Curvelet transformation [93] to remove seam (for cise, it is Fig. 17; for rail, it is Fig. 18). All these images are registered by SIFT, the best-performed descriptor discussed in Sect. 2.3.

PSNR is one of criterions for image assessment. However, PSNR needs reference image, which will sometimes be difficult to obtain. In addition, its results will not correspond to human vision in some cases. This is because the sense of human vision will be affected by many factors such as spatial frequency and brightness. SSIM is calculated based on a sensation model which takes the varying of structural information of an image into consideration. It has strong ability in measuring information loss occurred during the image degradation processes. It is also easy to compute and applicable to various images. However, SSIM also needs the reference image.

In the assessment results, optimal is taken for reference. Thus, its PSNR is ∞, and its SSIM is 1. From the results, we can see that the PSNR and SSIM indices of Curvelet is the highest among the others, meaning its quality is the best. We can also find that there is an obvious ghost artefact in rail-weighted (Fig. 12). In PSNR, its score decreased sharply compared with rail-original, but in SSIM, its score is almost the same as rail-original. This implies the importance of selecting proper seam elimination method, as an improper selection may cause counterproductive results.

An image assessment model based on deep learning is proposed in [105], where the assessment is converted into a 5-level classification system. Levels such as Excellent, Good, Fair, Poor, Bad are set. In this method, P(Q|L) is the posterior probability based on Bayes function with L levels of decomposition and intrinsic quality Q. Given the input image representation X, the distribution of the intrinsic quality can be obtained by marginal distribution:

$$ P(Q|X) = \int {P(Q|L)P(L|X){\text{d}}L} $$
(34)

P(Q|X) is the quality distribution which represents the evaluations by a population. The numerical measurement of image quality is

$$ {\text{Quality}} = E(P(Q|X)) $$
(35)

E(x) is the mean value of x.

Experimental results show this method is of effectiveness, efficiency, and robustness to small training sets. It corresponds well to human vision. However, this method is hand-crafted, which requires time-consuming hand-tuning. It also fails to express certain eccentric distortions. This means using deep learning to learn more powerful image representation for describing image quality remains a great challenge that has yet to be resolved.

A difference of edge map (DoEM) assessment method is proposed for stitched images in [106]. It detects the mean values of brightness in local edge difference maps and divides them to the entire mean value. It judges whether the difference in brightness plays a major role according to the ratio. When generating marks, the ratio of brightness mutation and misplacement assessments adjusts dynamically with the variance of differentiate map. DoEM is defined by the following equation:

$$ {\text{DoEM}} = e^{{ - \frac{{\sigma^{2} }}{{c_{4} }}}} \left( {\frac{{\mu_{{\text{e}}} e^{{ - \frac{{\mu_{{\text{e}}} }}{{c_{1} }}}} + \mu_{{\text{a}}} e^{{ - \frac{{\mu_{{\text{a}}} }}{{c_{2} }}}} }}{{\mu_{{\text{e}}} + \mu_{{\text{a}}} }}} \right) + \left( {1 + e^{{ - \frac{{\sigma^{2} }}{{c_{4} }}}} } \right)e^{{ - \frac{{\sigma^{2} }}{{c_{3} }}}} , $$
(36)

where μe is the mean value of the transition region in edge difference map, μa is the entire mean value in the transition region, σ2 is the entire variance in the transition area, C1, C2, C3 and C4 are four constants. C1 and C2 are determined according to the relativity of the mean value difference in the differential map, C3 and C4 are on the sub-Gaussian distribution curve. According to numerous experiments, the suggest values are given as C1 = 80, C2 = 50, C3 = 600, C4 = 256. The final mark is normalized between 0 and 1, the higher mark means the better quality of an image. Table 4 is the assessment result with DoEM method.

Table 4 Assessment of stitched images using DoEM

Experiments show that in cise, curvelet based method has the best score, while in rail, original image without seam removal has the best score. We can also see that in rail-weighted where a ghost artefact appears, its DoEM score also decreases a lot. This means DoEM could accurately reflects the real quality of stitched image and the performance of image stitching algorithm. However, this metric is not suitable for assessing the stitching sequence.

An approach to assess stitched images is proposed in [107]. SSIM, SAM (spectral angle mapper) and IMR (intensity magnitude ratio) are applied to assess geometric and photometric qualities of stitched images. Suppose p1 and p2 are two pixels from two images, |p1| and |p2| denote their norms, SAM calculates the angle between them using the dot product formula:

$$ {\text{SAM}} = \arccos \frac{{p_{1} \cdot p_{2} }}{{\left| {p_{1} } \right|\left| {p_{2} } \right|}}. $$
(37)

If the angle between two vectors is large, SAM will be small and it indicates that the two vectors/image pixels are different.

For intensity quality assessment, IMR, defined as the ratio of magnitudes of two 3D color vectors, is used. If p1 and p2 are two pixels, |p1| and |p2| denote their norms, then IMR is defined as

$$ {\text{IMR}} = \frac{{\min \left( {\left| {p_{1} } \right|,\left| {p_{2} } \right|} \right)}}{{\max \left( {\left| {p_{1} } \right|,\left| {p_{2} } \right|} \right)}}. $$
(38)

The algorithmic procedure for IMR calculation is similar to that of colour quality assessment in SAM, except that instead of using SAM, IMR is used. The threshold for significant error in IMR is considered as 2%. This metric can be effectively used to quantitatively determine the quality of stitched images. Its results agree with qualitative analysis and the method is extensible.

In our experiments, we printed the pixel value and the gradient value 25 pixels both sides near the stitching seam. To make the results show more clearly, the figures here are results of one randomly selected column in each image. Figure 19 is the assessment result of cise and Fig. 20 is the assessment result of rail. Three lines in each figure show three channels (i.e., R, G and B) in the image, respectively. The full line corresponds to the R channel, the break line corresponds to the G channel and the dot line corresponds to the B channel.

Fig. 19
figure 19

Assessment results of cise

Fig. 20
figure 20

Assessment results of rail

We also calculate the variance of the gradient value in the columns selected in the figure. The variance demonstrates the degree of dispersion of data (pixels). Such data are shown in Tables 5 and 6. M means the average value of R, G and B in a row. It can be calculated by the following equation, where R, G and B are the mean gradient variance in R, G and B channel, respectively. The greater variance value shows that the data is more discrete in the set:

$$ M = \frac{R + G + B}{3}. $$
(39)
Table 5 Assessment of cise using gradient variance
Table 6 Assessment of rail using gradient variance

From Figs. 19a and 20a, we can see there is a gradient difference between two images, making an obvious stitching seam. Hence, the goal of seam removal is to reduce such differences. When we use weighted, optimal and curvelet for seam removal, we can see the gradient variance has reduced, meaning the seam is going to disappear. We can also see that when we use wavelet, the gradient variance even increases. As the image can be abstracted and approximated by curves, this shows that the representation of curves in wavelet cannot meet the demand of seamless stitching.

By combining the results in figures and tables, we can see that it is better to take optimal seam method for seam removal in our case. However, there is not a method that is suitable for all kinds of seam removal in image stitching, because there are many factors affecting the attribute of the seam.

5 Discussion

Image stitching has been quickly developing in recent years. Many new methods related to image stitching have been proposed. However, there are still some points that can be improved in the future.

First is image registration with a large number of images. The errors of matching and registering two images will be enlarged when we stitch a large number of images and form it into a whole panorama image. Since the error is not evitable, we need to further transform the images to reduce it.

Second is image stitching with 3-dimensional images, which not only have the colour information, but also contain the information in the depth dimension. We need to adjust the matching and registration methods to take the depth dimension into consideration. Using point cloud might be one of the solutions.

Third is to develop image stitching for special purposes, including transplanting the algorithm into embedded systems such as cell phones, digital camera, unmanned aircraft, tablet and specific algorithms for geographical, biomedical, under water images. This can be done by deeply investigating both its application field and the related algorithms.

With the increasing application of panorama image, new requirements are set to image stitching. It has become increasingly important to make the whole stitching process faster and easier. It is also a tendency that the image stitched will become more precise. The stitching algorithms that are designed for specific projects such as geographical, biomedical, under water images will appear. A mature algorithm should achieve a balance between time consuming and precision, making an optimal result that can be accepted by most of audience.

6 Conclusion

Image stitching relates to domains such as computer vision, image processing, computer graphics and software developing. The effects of image registration and seam removal play key roles in a successful stitching. In this article, some methods about image registration and seam removal are introduced and analysed. Then, image assessment methods are briefly presented.

In recent years, more and more improved image stitching methods keep appearing. Some algorithms can register images in a fast way, and some combine multi registration algorithms altogether. Many new seam removal methods will be proposed at the same time. Unfortunately, there is not an algorithm suitable for all the stitching, nor is it easy to select a specific method for certain purposes. However, with the development of new techniques, image stitching will be more precise and have wider applications.