Keywords

1 Introduction

Rolling Shutter (RS) cameras adopt CMOS sensors due to their low cost and simplicity in manufacturing. This stands in contrast to Global Shutter (GS) CCD cameras that require specialized and highly dedicated fabrication. Such discrepancy endows RS cameras great advantage for ubiquitous employment in consumer products, e.g., smartphone cameras  [44] or dashboard cameras  [12]. However, the expediency in fabrication also causes a serious defect in image capture—instead of capturing different scanlines all at once as in GS cameras, RS cameras expose each scanline one by one sequentially from top to bottom. While static RS camera capturing a static scene is fine, the RS effect comes to haunt us as soon as images are taken during motion, i.e., images could be severely distorted due to scanline-varying camera poses (see Fig. 1).

RS distortion has been rearing its ugly head in various computer vision tasks. There is constant pressure to either remove the RS distortion in the front-end image capture  [25, 48, 50, 61], or design task-dependent RS-aware algorithms in the back end  [2, 10, 15, 42, 46, 51, 54]. While various algorithms have been developed for each of them in isolation, algorithms achieving both in a holistic way are few  [24, 52, 62, 66]. In this paper, we make contributions towards further advancement in this line. Specifically, we propose a novel differential homography and demonstrate its application to carry out RS image stitching and rectification at one stroke.

Fig. 1.
figure 1

Example results of RS-aware image stitching and rectification.

RS effect complicates the two-view geometry significantly compared to its GS counterpart, primarily because 12 additional unknown parameters are required to model the intra-frame velocity of the two cameras. Thus, despite the recent effort of Lao et al.  [24] in solving a generic RS homography for discrete motion, the complexity of RS geometry significantly increases the number of required correspondences (36 points for full model and 13.5 points after a series of approximations). Inspired by prior work  [66] that demonstrates dramatic simplification in differential RS relative pose estimation compared to its discrete counterpart  [10], we focus in this paper on the special yet common case where the inputs are two consecutive frames from a video. In this case, the inter-frame motion is restricted from being arbitrarily large, allowing us to adopt the simpler differential homography model  [39]. Furthermore, the intra-frame motion could be directly parameterized by the inter-frame motion via interpolation, thereby reducing the total number of unknown parameters to solve. In particular, we derive an RS-aware differential homography under constant acceleration motion assumption, together with a straightforward solver requiring only 5 pairs of correspondences, and demonstrate its application to simultaneous RS image stitching and rectification. Since a single homography warping is only exact under pure rotational camera motion or for 3D planar scene, it often causes misalignment when such condition is not strictly met in practice. To address such model inadequacy, we extend the single RS homography model to a spatially-varying RS homography field following the As-Projective-As-Possible (APAP) principle  [63], thereby lending itself to handling complex scenes. We demonstrate example results in Fig. 1, where multiple images are stitched and rectified by concatenating pairwise warping from our method.

We would also like to emphasize our advantage over the differential Structure-from-Motion (SfM)-based rectification method  [66]. Note that [66] computes the rectification for each pixel separately via pixel-wise depth estimation from optical flow and camera pose. As such, potential gross errors in optical flow estimates could lead to severe artifacts in the texture-less or non-overlapping regions. In contrast, the more parsimonious homography model offers a natural defense against wrong correspondences. Despite its lack of full 3D reconstruction, we observe good empirical performance in terms of visual appearance.

In summary, our contributions include:

  • We derive a novel differential homography model together with a minimal solver to account for the scanline-varying camera poses of RS cameras.

  • We propose an RS-aware spatially-varying homography field for improving RS image stitching.

  • Our proposed framework outperforms state-of-the-art methods both in RS image rectification and stitching.

2 Related Work

RS Geometry. Since the pioneering work of Meingast et al.  [41], considerable efforts have been invested in studying the geometry of RS cameras. These include relative pose estimation  [10, 47, 66], absolute pose estimation  [2, 3, 23, 26, 40, 56], bundle adjustment  [15, 22], SfM/Reconstruction  [20, 54, 55, 58], degeneracies  [4, 21, 68], discrete homography  [24], and others  [5, 45]. In this work, we introduce RS-aware differential homography, which is of only slighly higher complexity than its GS counterpart.

RS Image Rectification. Removing RS artifacts using a single input image is inherently an ill-posed problem. Works in this line   [25, 48, 50] often assume simplified camera motions and scene structures, and require line/curve detection in the image, if available at all. Recent methods  [49, 68] have started exploring deep learning for this task. However, their generalization ability to different scenes remains an open problem. In contrast, multi-view approaches, be it geometric-based or learning-based  [35], are more geometrically grounded. In particular, Ringaby and Forssen  [52] estimate and smooth a sequence of camera rotations for eliminating RS distortions, while Grundmann et al.  [11] and Vasu et al.  [61] use a mixture of homographies to model and remove RS effects. Such methods often rely on nontrivial iterative optimization leveraging a large set of correspondences. Recently, Zhuang et al.  [66] present the first attempt to derive minimal solver for RS rectification. It takes a minimal set of points as input and lends itself well to RANSAC, leading to a more principled way for robust estimation. In the same spirit, we derive RS-aware differential homography and show important advantages. Note that our minimal solver is orthogonal to the optimization-based methods, e.g.  [52, 61], and can serve as their initialization. Very recently, Albl et al.  [1] present an interesting way for RS undistortion from two cameras, yet require specific camera mounting.

GS Image Stitching. Image stitching  [59] has achieved significant progress over the past few decades. Theoretically, a single homography is sufficient to align two input images of a common scene if the images are captured with no parallax or the scene is planar  [13]. In practice, this condition is often violated, causing misalignments or ghosting artifacts in the stitched images. To overcome this issue, several approaches have been proposed such as spatially-varying warps  [27, 29, 33, 34, 63], shape-preserving warps  [7, 8, 30], and seam-driven methods  [17, 18, 31, 65]. All of the above approaches assume a GS camera model and hence they cannot handle RS images, i.e., the stitched images may contain RS distortion-induced misalignment. While Lao et al.  [24] demonstrate the possibility of stitching in spite of RS distortion, we present a more concise and straightforward method that works robustly with hand-held cameras.

3 Homography Preliminary

GS Discrete Homography. Let us assume that two calibrated cameras are observing a 3D plane parameterized as \((\textit{\textbf{n}},d)\), with \(\textit{\textbf{n}}\) denoting the plane normal and d the camera-to-plane distance. Denoting the relative camera rotation and translation as \(\textit{\textbf{R}} \in SO(3)\) and \(\textit{\textbf{t}}\in \mathbb {R}^3\), a pair of 2D correspondences \(\textit{\textbf{x}}_1\) and \(\textit{\textbf{x}}_2\) (in normalized plane) can be related by \(\hat{\textit{\textbf{x}}}_2 \propto \textit{\textbf{H}}\hat{\textit{\textbf{x}}}_1\), where \(\textit{\textbf{H}} = \textit{\textbf{R}}+{\textit{\textbf{tn}}^\top }/{d}\) is defined as the discrete homography  [13] and \(\hat{\textit{\textbf{x}}}=[\textit{\textbf{x}}^\top ,1]^\top \). \(\propto \) indicates equality up to a scale. Note that \(\textit{\textbf{H}}\) in the above format subsumes the pure rotation-induced homography as a special case by letting \(d\rightarrow \infty \). Each pair of correspondence \(\{\textit{\textbf{x}}_1^i,\textit{\textbf{x}}_2^i\}\) gives two constraints \(\textit{\textbf{a}}_i\textit{\textbf{h}}=\textit{\textbf{0}}\), where \(\textit{\textbf{h}} \in \mathbb {R}^{9}\) is the vectorized form of \(\textit{\textbf{H}}\) and the coefficients \(\textit{\textbf{a}}_i\in \mathbb {R}^{2\times 9}\) can be computed from \(\{\textit{\textbf{x}}_1^i,\textit{\textbf{x}}_2^i\}\). In GS discrete 4-point solver, with the minimal of 4 points, one can solve \(\textit{\textbf{h}}\) via:

$$\begin{aligned} \textit{\textbf{Ah}}=\textit{\textbf{0}},~~~s.t.~ \Vert \textit{\textbf{h}}\Vert =1, \end{aligned}$$
(1)

which has a closed-form solution by Singular Value Decomposition (SVD). \(\textit{\textbf{A}}\) is obtained by stacking all \(\textit{\textbf{a}}_i\).

GS Spatially-Varying Discrete Homography Field. In image stitching application, it is often safe to make zero-parallax assumption as long as the (non-planar) scene is far enough. However, it is also not uncommon that such assumption is violated to the extent that warping with just one global homography causes unpleasant misalignments. To address this issue, APAP  [63] proposes to compute a spatially-varying homograpy field for each pixel \(\textit{\textbf{x}}\):

$$\begin{aligned} \textit{\textbf{h}}^*(\textit{\textbf{x}}) = \arg \min _{\textit{\textbf{h}}} \sum _{i\in \mathcal {I}} \Vert w_i(\textit{\textbf{x}})\textit{\textbf{a}}_i\textit{\textbf{h}}\Vert ^2,~~~s.t.~ \Vert \textit{\textbf{h}}\Vert =1, \end{aligned}$$
(2)

where \(w_i(\textit{\textbf{x}}) = \max (\exp (-\frac{\Vert \textit{\textbf{x}}-\textit{\textbf{x}}_i\Vert ^2}{\sigma ^2}),\tau )\) is a weight. \(\sigma \) and \(\tau \) are the pre-defined scale and regularization parameters respectively. \(\mathcal {I}\) indicates the inlier set returned from GS discrete 4-point solver with RANSAC (motivated by  [60]). The optimization has a closed-form solution by SVD. On the one hand, Eq. 2 encourages the warping to be globally As-Projective-As-Possible (APAP) by making use of all the inlier correspondences, while, on the other hand, it allows local deformations guided by nearby correspondences to compensate for model deficiency. Despite being a simple tweak, it yet leads to considerable improvement in image stitching.

GS Differential Homography. Suppose the camera is undergoing an instantaneous motion  [19], consisting of rotational and translational velocity \((\varvec{\omega },\textit{\textbf{v}})\). It would induce a motion flow \(\textit{\textbf{u}}\in \mathbb {R}^2\) in each image point \(\textit{\textbf{x}}\). Denoting \(\tilde{\textit{\textbf{u}}}=[\textit{\textbf{u}}^\top , 0]^\top \), we haveFootnote 1

$$\begin{aligned} \tilde{\textit{\textbf{u}}} = (\textit{\textbf{I}}-\hat{\textit{\textbf{x}}}\textit{\textbf{e}}_3^\top )\textit{\textbf{H}}\hat{\textit{\textbf{x}}}, \end{aligned}$$
(3)

where \(\textit{\textbf{H}}= -(\lfloor \varvec{\omega } \rfloor _\times + {\textit{\textbf{vn}}^\top }/{d})\) is defined as the differential homography  [39]. \(\textit{\textbf{I}}\) represents identity matrix and \(\textit{\textbf{e}}_3 = [0,0,1]^\top \). \(\lfloor .\rfloor _\times \) returns the corresponding skew-symmetric matrix from the vector. Each flow estimate \(\{\textit{\textbf{u}}_i,\textit{\textbf{x}}_i\}\) gives two effective constraints out of the three equations included in Eq. 3, denoted as \(\textit{\textbf{b}}_i\textit{\textbf{h}}=\textit{\textbf{u}}_i\), where \(\textit{\textbf{b}}_i\in \mathbb {R}^{2\times 9}\) can be computed from \(\textit{\textbf{x}}_i\). In GS differential 4-point solver, with a minimal of 4 flow estimates, \(\textit{\textbf{H}}\) can be computed by solving:

$$\begin{aligned} \textit{\textbf{Bh}}=\textit{\textbf{U}}, \end{aligned}$$
(4)

which admits closed-form solution by pseudo inverse. \(\textit{\textbf{B}}\) and \(\textit{\textbf{U}}\) are obtained by stacking all \(\textit{\textbf{b}}_i\) and \(\textit{\textbf{u}}_i\), respectively. Note that, we can only recover \(\textit{\textbf{H}}_L = \textit{\textbf{H}}+\varepsilon \textit{\textbf{I}}\) with an unknown scale \(\varepsilon \), because \(\textit{\textbf{B}}\) has a one-dimensional null space. One can easily see this by replacing \(\textit{\textbf{H}}\) in Eq. 3 with \(\varepsilon \textit{\textbf{I}}\) and observing that the right hand side vanishes, regardless of the value of \(\textit{\textbf{x}}\). \(\varepsilon \) can be determined subsequently by utilizing the special structure of calibrated \(\textit{\textbf{H}}\). However, this is not relevant in our paper since we focus on image stitching on general uncalibrated images.

4 Methods

4.1 RS Motion Parameterization

Under the discrete motion model, in addition to the 6-Degree of Freedom (DoF) inter-frame relative motion \((\textit{\textbf{R}},\textit{\textbf{t}})\), 12 additional unknown parameters \((\varvec{\omega }_1,\textit{\textbf{v}}_1)\) and \((\varvec{\omega }_2,\textit{\textbf{v}}_2)\) are needed to model the intra-frame camera velocity, as illustrated in Fig. 2(a). This quickly increases the minimal number of points and the algorithm complexity to compute an RS-aware homography. Instead, we aim to solve for the case of continuous motion, i.e., a relatively small motion between two consecutive frames. In this case, we only need to parameterize the relative motion \((\varvec{\omega },\textit{\textbf{v}})\) between the two first scanlines (one can choose other reference scanlines without loss of generality) of the image pair, and the poses corresponding to all the other scanlines can be obtained by interpolation, as illustrated in Fig. 2(b). In particular, it is shown in [66] that a quadratic interpolation can be derived under constant acceleration motion. Formally, the absolute camera rotation and translation (\(\textit{\textbf{r}}_1^{y_1}\), \(\textit{\textbf{p}}_1^{y_1}\)) (resp. (\(\textit{\textbf{r}}_2^{y_2}\), \(\textit{\textbf{p}}_2^{y_2}\))) of scanline \(y_1\) (resp. \(y_2\)) in frame 1 (resp. 2) can be written as:

$$\begin{aligned} \textit{\textbf{r}}_1^{y_1}&= \beta _1(k,y_1)\varvec{\omega },~~~\textit{\textbf{p}}_1^{y_1} = \beta _1(k,y_1)\textit{\textbf{v}},\end{aligned}$$
(5)
$$\begin{aligned} \textit{\textbf{r}}_2^{y_2}&= \beta _2(k,y_2)\varvec{\omega },~~~\textit{\textbf{p}}_2^{y_2} = \beta _2(k,y_2)\textit{\textbf{v}}, \end{aligned}$$
(6)

where

$$\begin{aligned} \beta _1(k,y_1)&= (\frac{\gamma y_1}{h}+\frac{1}{2}k(\frac{\gamma y_1}{h})^2)(\frac{2}{2+k}),\end{aligned}$$
(7)
$$\begin{aligned} \beta _2(k,y_2)&= (1+\frac{\gamma y_2}{h}+\frac{1}{2}k(1+\frac{\gamma y_2}{h})^2)(\frac{2}{2+k}). \end{aligned}$$
(8)

Here, k is an extra unknown motion parameter describing the acceleration, which is assumed to be in the same direction as velocity. \(\gamma \) denotes the the readout time ratio  [66], i.e. the ratio between the time for scanline readout and the total time between two frames (including inter-frame delay). h denotes the total number of scanlines in a image. Note that the absolute poses (\(\textit{\textbf{r}}_1^{y_1}\), \(\textit{\textbf{p}}_1^{y_1}\)) and (\(\textit{\textbf{r}}_2^{y_2}\), \(\textit{\textbf{p}}_2^{y_2}\)) are all defined w.r.t the first scanline of frame 1. It follows that the relative pose between scanlines \(y_1\) and \(y_2\) reads:

$$\begin{aligned} \varvec{\omega }_{y_1y_2}&= \textit{\textbf{r}}_2^{y_2} - \textit{\textbf{r}}_1^{y_1} = (\beta _2(k,y_2)-\beta _1(k,y_1))\varvec{\omega },\end{aligned}$$
(9)
$$\begin{aligned} \textit{\textbf{v}}_{y_1y_2}&= \textit{\textbf{p}}_2^{y_2} - \textit{\textbf{p}}_1^{y_1} = (\beta _2(k,y_2)-\beta _1(k,y_1))\textit{\textbf{v}}. \end{aligned}$$
(10)

We refer the readers to [66] for the detailed derivation of the above equations.

Fig. 2.
figure 2

Illustration of discrete/continuous camera motion and their motion parameters.

4.2 RS-Aware Differential Homography

We are now in a position to derive the RS-aware differential homography. First, it is easy to verify that Eq. 3 also applies uncalibrated cameras, under which case \(\textit{\textbf{H}} = -\textit{\textbf{K}}(\lfloor \varvec{\omega } \rfloor _\times + \textit{\textbf{vn}}^\top /d)\textit{\textbf{K}}^{-1}\), with \(\textit{\textbf{u}}\) and \(\textit{\textbf{x}}\) being raw measurements in pixels. \(\textit{\textbf{K}}\) denotes the unknown camera intrinsic matrix. Given a pair of correspondence by \(\{\textit{\textbf{u}},\textit{\textbf{x}}\}\), we can plug \((\varvec{\omega }_{y_1y_2},\textit{\textbf{v}}_{y_1y_2})\) into Eq. 3, yielding

$$\begin{aligned} \tilde{\textit{\textbf{u}}}&= (\beta _2(k,y_2) - \beta _1(k,y_1))(\textit{\textbf{I}}-\hat{\textit{\textbf{x}}}\textit{\textbf{e}}_3^\top )\textit{\textbf{H}}\hat{\textit{\textbf{x}}} = \beta (k,y_1,y_2)(\textit{\textbf{I}}-\hat{\textit{\textbf{x}}}\textit{\textbf{e}}_3^\top )\textit{\textbf{H}}\hat{\textit{\textbf{x}}}. \end{aligned}$$
(11)

Here, we can define \(\textit{\textbf{H}}_{RS} = \beta (k,y_1,y_2)\textit{\textbf{H}}\) as the RS-aware differential homography, which is now scanline dependent.

5-Point Solver. In addition to \(\textit{\textbf{H}}\), we now have one more unknown parameter k to solve. Below, we show that 5 pairs of correspondences are enough to solve for k and \(\textit{\textbf{H}}\), using the so-called hidden variable technique  [9]. To get started, let us first rewrite Eq. 11 as:

$$\begin{aligned} \beta (k,y_1,y_2)\textit{\textbf{b}}\textit{\textbf{h}} = \textit{\textbf{u}}. \end{aligned}$$
(12)

Next, we move \(\textit{\textbf{u}}\) to the left hand side and stack the constraints from 5 points, leading to:

$$\begin{aligned} \textit{\textbf{C}}\hat{\textit{\textbf{h}}} = \textit{\textbf{0}}, \end{aligned}$$
(13)

where

$$\begin{aligned} \textit{\textbf{C}} = \begin{bmatrix} \beta _1(k,y_1^1,y_2^1)\textit{\textbf{b}}_1, &{}~ -\textit{\textbf{u}}_1 \\ \beta _2(k,y_1^2,y_2^2)\textit{\textbf{b}}_2, &{}~ -\textit{\textbf{u}}_2 \\ \beta _3(k,y_1^3,y_2^3)\textit{\textbf{b}}_3, &{}~ -\textit{\textbf{u}}_3 \\ \beta _4(k,y_1^4,y_2^4)\textit{\textbf{b}}_4, &{}~ -\textit{\textbf{u}}_4 \\ \beta _5(k,y_1^5,y_2^5)\textit{\textbf{b}}_5, &{}~ -\textit{\textbf{u}}_5 \\ \end{bmatrix}, ~~~~~\hat{\textit{\textbf{h}}} = [\textit{\textbf{h}}^T, ~1]^T. \end{aligned}$$
(14)

It is now clear that, for \(\textit{\textbf{h}}\) to have a solution, \(\textit{\textbf{C}}\) must be rank-deficient. Further observing that \(\textit{\textbf{C}} \in \mathbb {R}^{10\times 10}\) is a square matrix, rank deficiency indicates vanishing determinate, i.e.,

$$\begin{aligned} det(\textit{\textbf{C}})=\textit{\textbf{0}}. \end{aligned}$$
(15)

This gives a univariable polynomial equation, whereby we can solve for k efficiently. \(\textit{\textbf{h}}\) can subsequently be extracted from the null space of \(\textit{\textbf{C}}\).

DoF Analysis. In fact, only 4.5 points are required in the minimal case, since we have one extra unknown k while each point gives two constraints. Utilizing 5 points nevertheless leads to a straightforward solution as shown. Yet, does this lead to an over-constrained system? No. Recall that we can only recover \(\textit{\textbf{H}}+\varepsilon \textit{\textbf{I}}\) up to an arbitrary \(\varepsilon \). Here, due to the one extra constraint, a specific value is chosen for \(\varepsilon \) since the last element of \(\hat{\textit{\textbf{h}}}\) is set to 1. Note that a true \(\varepsilon \), thus \(\textit{\textbf{H}}\), is not required in our context since it does not affect the warping. This is in analogy to uncalibrated SfM  [13] where a projective reconstruction up to an arbitrary projective transformation is not inferior to the Euclidean reconstruction in terms of reprojection error.

Plane Parameters. Strictly speaking, the plane parameters slightly vary as well due to the intra-frame motion. This is however not explicitly modeled in Eq. 11, due to two reasons. First, although the intra-frame motion is in a similar range as the inter-frame motion (Fig. 2(b)) and hence has a large impact in terms of motion, it induces merely a small perturbation to the absolute value of the scene parameters, which can be safely ignored (see supplementary for a more formal characterization). Second, we would like to keep the solver as simple as possible as along as good empirical results are obtained (see Sect. 5).

Motion Infidelity vs. Shutter Fidelity. Note that the differential motion model is always an approximation specially designed for small motion. This means that, unlike its discrete counterpart, its fidelity decreases with increasing motion. Yet, we are only interested in relatively large motion such that the RS distortion reaches the level of being visually unpleasant. Therefore, a natural and scientifically interesting question to ask is, whether the benefits from modeling RS distortion (Shutter Fidelity) are more than enough to compensate for the sacrifices due to the approximation in motion model (Motion Infidelity). Although a theoretical characterization on such comparison is out of the scope of this paper, via extensive experiments in Sect. 5, we fortunately observe that the differential RS model achieves overwhelming dominance in this competition.

Degeneracy. Are there different pairs of k and \(\textit{\textbf{H}}\) that lead to the same flow field \(\textit{\textbf{u}}\)? Although such degeneracy does not affect stitching, it does make a difference to rectification (Sect. 4.4). We leave the detailed discussion to the supplementary, but would like the readers to be assured that such cases are very rare, in accordance with Horn  [19] that motion flow is hardly ambiguous.

More Details. Firstly, note that although \(\{\textit{\textbf{u}},\textit{\textbf{x}}\}\) is typically collected from optical flow in classical works  [16, 38] prior to the advent of keypoint descriptors (e.g., [37, 53]), we choose the latter for image stitching for higher efficiency. Secondly, if we fix \(k=0\), i.e., constant velocity model, \((\varvec{\omega },\textit{\textbf{v}})\) could be solved using a linear 4-point minimal solver similar to the GS case. However, we empirically find its performance to be inferior to the constant acceleration model in shaking cameras, and shall not be further discussed here.

4.3 RS-Aware Spatially-Varying Differential Homography Field

Can GS APAP  [63] Handle RS Distortion by Itself? As aforementioned, the adaptive weight in APAP (Eq. 2) permits local deformations to account for the local discrepancy from the global model. However, we argue that APAP alone is still not capable of handling RS distortion. The root cause lies in the GS homography being used—although the warping of pixels near correspondences are less affected, due to the anchor points role of correspondences, the warping of other pixels still relies on the transformation propagated from the correspondences and thus the model being used does matter here.

RS-Aware APAP. Obtaining a set of inlier correspondences \(\mathcal {I}\) from our RS differential 5-point solver with RANSAC, we formulate the spatially-varying RS-aware homography field as:

$$\begin{aligned} \textit{\textbf{h}}^*(\textit{\textbf{x}}) = \arg \min _{\textit{\textbf{h}}} \sum _{i\in \mathcal {I}} \Vert w_i(\textit{\textbf{x}})(\beta (k,y_1,y_2)\textit{\textbf{b}}_i\textit{\textbf{h}} - \textit{\textbf{u}}_i)\Vert ^2, \end{aligned}$$
(16)

where \(w_i(\textit{\textbf{x}})\) is defined in Sect. 3. Since k is a pure motion parameter independent of the scene, we keep it fixed in this stage for simplicity. Normalization strategy  [14] is applied to \((\textit{\textbf{u}},\textit{\textbf{x}})\) for numerical stability. We highlight that the optimization has a simple closed-form solution, yet is geometrically meaningful in the sense that it minimizes the error between the estimated and the observed flow \(\textit{\textbf{u}}\). This stands in contrast with the discrete homography for which minimizing reprojection error requires nonlinear iterative optimization. In addition, we also observe higher stability from the differential model in cases of keypoints concentrating in a small region (see supplementary for discussions).

4.4 RS Image Stitching and Rectification

Once we have the homography \(\textit{\textbf{H}}\) (either a global one or a spatially-varying field) mapping from frame 1 to frame 2, we can warp between two images for stitching. Referring to Fig. 2(b) and Eq. 11, for each pixel \(\textit{\textbf{x}}_1=[x_1, y_1]^\top \) in frame 1, we find its mapping \(\textit{\textbf{x}}_2=[x_2,y_2]^\top \) in frame 2 by first solving for \(y_2\) as:

$$\begin{aligned} y_2 = y_1 + \lfloor (\beta _2(k,y_2)-\beta _1(k,y_1))(\textit{\textbf{I}}-\hat{\textit{\textbf{x}}}_1\textit{\textbf{e}}_3^\top )\textit{\textbf{H}}\hat{\textit{\textbf{x}}}_1 \rfloor _y, \end{aligned}$$
(17)

which admits a closed-form solution. \(\lfloor .\rfloor _y\) indicates taking the y coordinate. \(x_2\) can be then obtained easily with known \(y_2\). Similarly, \(\textit{\textbf{x}}_1\) could also be projected to the GS canvas defined by the pose corresponding to the first scanline of frame 1, yielding its rectified point \(\textit{\textbf{x}}_{g1}\). \(\textit{\textbf{x}}_{g1}\) can be solved according to

$$\begin{aligned} \textit{\textbf{x}}_1 = \textit{\textbf{x}}_{g1} +\lfloor \beta _1(k,y_1)(\textit{\textbf{I}}-\hat{\textit{\textbf{x}}}_{g1}\textit{\textbf{e}}_3^\top )\textit{\textbf{H}}\hat{\textit{\textbf{x}}}_{g1}\rfloor _{xy}, \end{aligned}$$
(18)

where \(\lfloor .\rfloor _{xy}\) indicates taking x and y coordinate.

5 Experiments

5.1 Synthetic Data

Data Generation. First, we generate motion parameters \((\varvec{\omega },\textit{\textbf{v}})\) and k with desired constraints. For each scanline \(y_1\) (resp. \(y_2\)) in frame 1 (resp. 2), we obtain its absolute pose as \((R(\beta _1(k,y_1)\varvec{\omega }),\beta _1(k,y_1)\textit{\textbf{v}})\) (resp. \((R(\beta _2(k,y_2)\varvec{\omega }),\beta _2(k,y_2)\textit{\textbf{v}})\)). Here, \(R(\varvec{\theta }) = \exp (\varvec{\lfloor \theta \rfloor _\times })\) with \(\exp \): so(3) \(\xrightarrow {}\) SO(3). Due to the inherent depth-translation scale ambiguity, the magnitude of \(\textit{\textbf{v}}\) is defined as the ratio between the translation magnitude and the average scene depth. The synthesized image plane is of size 720 \(\times \,1280\) with a \(60^\circ \) horizontal Field Of View (FOV). Next, we randomly generate a 3D plane, on which we sample 100 3D points within FOV. Finally, we project each 3D point \(\textit{\textbf{X}}\) to the RS image. Since we do not know which scanline observes \(\textit{\textbf{X}}\), we first solve for \(y_1\) from the quadratic equation:

$$\begin{aligned} y_1 = \lfloor \pi (\textit{\textbf{R}}(\beta _1(k,y_1)\varvec{\omega })(\textit{\textbf{X}}-\beta _1(k,y_1)\textit{\textbf{v}})) \rfloor _y, \end{aligned}$$
(19)

where \(\pi ([a,b,c]^\top ){=}[a/c,b/c]^\top \). \(x_1\) can then be obtained easily with known \(y_1\). Likewise, we obtain the projection in frame 2.

Comparison Under Various Configurations. First, we study the performance under the noise-free case to understand the intrinsic and noise-independent behavior of different solvers, including discrete GS 4-point solver (‘GS-disc’), differential GS 4-point solver (‘GS-diff’) and our RS 5-point solver (‘RS-ConstAcc’). Specifically, we test the performance with varying RS readout time ratio \(\gamma \), rotation magnitude \(\Vert \varvec{\omega }\Vert \), and translation magnitude \(\Vert \textit{\textbf{v}}\Vert \). To get started, we first fix \((\Vert \varvec{\omega }\Vert ,\Vert \textit{\textbf{v}}\Vert )\) to \((3^\circ ,0.03)\), and increase \(\gamma \) from 0 to 1, indicating zero to strongest RS effect. Then, we fix \(\gamma =1, \Vert \textit{\textbf{v}}\Vert =0.03\) while increasing \(\Vert \varvec{\omega }\Vert \) from \(0^\circ \) to \(9^\circ \). Finally, we fix \(\gamma =1,\Vert \varvec{\omega }\Vert =3^\circ \) while increasing \(\Vert \textit{\textbf{v}}\Vert \) from 0 to 0.1. We report averaged reprojection errors over all point pairs in Fig. 3(a)–(c). The curves are averaged over 100 configurations with random plane and directions of \(\varvec{\omega }\) and \(\textit{\textbf{v}}\).

First, we observe that ‘GS-diff’ generally underperforms ‘GS-disc’ as expected due to its approximate nature (cf. ‘Motion Infidelity’ in Sect. 4.2). In (a), although ‘RS-ConstAcc’ performs slightly worse than ‘GS-disc’ under small RS effect (\(\gamma <=0.1\)), it quickly surpasses ‘GS-disc’ significantly with increasing \(\gamma \) (cf. ‘Shutter Fidelity’ in Sect. 4.2). Moreover, this is constantly true in (b) and (c) with the gap becoming bigger with increasing motion magnitude. Such observations suggest that the gain due to handling RS effect overwhelms the degradation brought about by the less faithful differential motion model. Further, we conduct investigation with noisy data by adding Gaussian noise (with standard deviations \(\sigma _g =1\) and \(\sigma _g =2\) pixels) to the projected 2D points. The updated results in the above three settings are shown in Fig. 3(d)–(f) and Fig. 3(g)–(i) for \(\sigma _g =1\) and \(\sigma _g =2\) respectively. Again, we observe considerable superiority of the RS-aware model, demonstrating its robustness against noise. We also conduct evaluation under different values of k, with \((\Vert \varvec{\omega }\Vert ,\Vert \textit{\textbf{v}}\Vert )=(3^\circ ,0.03)\), \(\gamma =1\), \(\sigma _g=1\). We plot \(\beta _1(k,y_1)\) against \(\frac{\gamma y_1}{h}\) with different values of k in Fig. 4(a) to have a better understanding of scanline pose interpolation. The reprojection error curves are plotted in Fig. 4(b). We observe that the performance of ‘GS-disc’ drops considerably with k deviating from 0, while ‘RS-ConstAcc’ maintains almost constant accuracy. Also notice the curves are not symmetric as \(k\,>\,0\) indicates acceleration (increasing velocity) while \(k\,<\,0\) indicates deceleration (decreasing velocity).

Fig. 3.
figure 3

Quantitative comparison under different configurations.

Fig. 4.
figure 4

Quantitative comparison under different values of k.

5.2 Real Data

We find that the RS videos used in prior works, e.g.  [11, 15, 52], often contain small jitters without large viewpoint change across consecutive frames. To demonstrate the power of our method, we collect 5 videos (around 2k frames in total) with hand-held RS cameras while running, leading to large camera shaking and RS distortion. Following  [35], we simply set \(\gamma =1\) to avoid its nontrival calibration  [41] and find it works well for our camera.

Two-View Experiments. Below we discuss the two-view experiment results.

Qualitative Evaluation. We first present a few qualitative examples to intuitively demonstrate the performance gap, in terms of RS image rectification and stitching. For RS image rectification, we compare our method with the differential SfM based approach  [66] (‘DiffSfM’) and the RS repair feature in Adobe After Effect (‘Adobe AE’). For RS image stitching, we compare with the single GS discrete homography stitching (‘GS’) and its spatially-varying extension  [63](‘APAP’). In addition, we also evaluate the sequential approaches which feed ‘DiffSfM’ (resp. ‘Adobe AE’) into ‘APAP’, denoted as ‘DiffSfM+APAP’ (resp. ‘Adobe AE+APAP’). We denote our single RS homography stitching without rectification as ‘RS’, our spatially-varying RS homography stitching without rectification as ‘RS-APAP’, and our spatially-varying RS homography stitching with rectification as ‘RS-APAP & Rectification’.

In general, we observe that although ‘DiffSfM’ performs very well for pixels with accurate optical flow estimates, it may cause artifacts elsewhere. Similarly, we find ‘Adobe AE’ to be quite robust on videos with small jitters, but often introduces severe distortion with the presence of strong shaking. Due to space limit, we show two example results here and leave more to the supplementary.

Fig. 5.
figure 5

Comparison of rectification/stitching on real RS images. Best viewed in screen.

In the example of Fig. 5, despite that ‘DiffSfM’ successfully rectifies the door and tube to be straight, the boundary parts (red circles) are highly skewed—these regions have no correspondences in frame 2 to compute flow. ‘Adobe AE’ manages to correct the images to some extent, yet bring evident distortion in the boundary too, as highlighted. ‘RS-APAP & Rectification’ nicely corrects the distortion with the two images readily stitched together. Regarding image stitching, we overlay two images after warping with the discrepancy visualized by green/red colors, beside which we show the linearly blended images. As can be seen, ‘GS’ causes significant misalignments. ‘APAP’ reduces them to some extent but not completely. The artifacts due to ‘DiffSfM’ and ‘Adobe AE’ persist in the stitching stage. Even for those non-boundary pixels, there are still misalignments as the rectification is done per frame in isolation, independent of the subsequent stitching. In contrast, we observe that even one single RS homography (‘RS’) suffices to warp the images accurately here, yielding similar result as ‘RS-APAP’.

Fig. 6.
figure 6

Comparison of rectification/stitching on real RS images. Best viewed in screen.

We show one more example in Fig. 6 with partial results (the rest are in the supplementary). ‘DiffSfM’ removes most of the distortion to the extent that ‘APAP’ warps majority of the scene accurately (‘DiffSfM+APAP’), yet, misalignments are still visible as highlighted, again, due to its sequential nature. We would like to highlight that APAP plays a role here to remove the misalignment left by the ‘RS’ and leads to the best stitching result.

Fig. 7.
figure 7

Quantitative evaluation under standard setting & further study.

Quantitative Evaluation. Here, we conduct quantitative evaluation to characterize the benefits brought about by our RS model. For every pair of consecutive frames, we run both ‘GS-disc’ and ‘RS-ConstAcc’, each with 1000 RANSAC trials. We compute for each pair the median reprojection error among all the correspondences, and plot its cumulative distribution function (CDF) across all the frame pairs, as shown in Fig. 7(a). Clearly, ‘RS-ConstAcc’ has higher-quality warping with reduced reprojection errors.

Although the above comparison demonstrates promising results in favor of the RS model, we would like to carry out further studies for more evidence, due to two reasons. First, note that the more complicated RS model has higher DoF and it might be the case that the smaller reprojection errors are simply due to over-fitting to the observed data, rather than due to truly higher fidelity of the underlying model. Second, different numbers (4 vs. 5) of correspondences are sampled in each RANSAC trial, leading to different amount of total samples used by the two algorithms. To address these concerns, we conduct two further investigations accordingly. First, for each image pair, we reserve 500 pairs of correspondences as test set and preclude them from being sampled during RANSAC. We then compare how well the estimated models perform on this set. Second, we test two different strategies to make the total number of samples equivalent—‘GS-MoreTrials’: increases the number of RANSAC trials for ‘GS-disc’ to \(1000\,{\times }\,5/4=1250\); ‘GS-5point’: samples non-minimal 5 points and get a solution in least squares sense in each trial. As shown in Fig. 7(b), although ‘GS-5point’ does improve the warping slightly, all the GS-based methods still lag behind the RS model, further validating the utility of our RS model.

Comparison with Homographies for Video Stabilization  [11, 36]. Here, we compare with the mesh-based spatially-variant homographies  [36] and the homography mixture  [11] proposed for video stabilization. We would like to highlight that the fundamental limitation behind   [11, 36] lies in that the individual homography is still GS-based, whereas ours explicitly models RS effect. We follow  [28, 32] to evaluate image alignment by the RMSE of one minus normalized cross correlation (NCC) over a neighborhood of \(3\,{\times }\,3\) window for the overlapping pixel \(\textit{\textbf{x}}_i\) and \(\textit{\textbf{x}}_j\), i.e. \(RMSE = \sqrt{\frac{1}{N}\sum _\pi (1-NCC(\textit{\textbf{x}}_i,\textit{\textbf{x}}_j))^2}\), with N being the total number of pixels in the overlapping region \(\pi \). As shown in Table 1, RS-APAP achieves lower averaged RMSE than [11, 36]. Surprisingly, [36] is not significantly better than GS, probably as its shape-preserving constraint becomes too strict for our strongly shaky videos. We also note that, in parallel with MDLT, our RS model could be integrated into  [11, 36] as well; this is however left as future works.

Table 1. RMSE evaluation for image stitching using different methods.
Fig. 8.
figure 8

Qualitative comparison to DiffSfM  [66], the method of Lao and Ait-Aider  [24], and the method of Vasu et al.  [61]. Stitched images with rectification are shown for [24] and ours.

Test on Data from [66]. We also compare with   [24, 61] on the 6 image pairs used in [66], with 2 shown in Fig. 8 and 4 in the supplementary. We show the results from our single RS model without APAP for a fair comparison to  [24, 66]. First, we observe that our result is not worse than the full 3D reconstuction method  [66]. In addition, it can be seen that our method performs on par with [24, 61], while being far more concise and simpler.

Multiple-View Experiments. We demonstrate an extension to multiple images by concatenating the pairwise warping (note that the undertermined \(\varepsilon \)’s do not affect this step). We show an example in Fig. 9 and compare with the multi-image APAP  [64], AutoStitch  [6] and Photoshop. AutoStitch result exhibits severe ghosting effects. APAP repairs them but not completely. Photoshop applies advanced seam cutting for blending, yet can not obscure the misalignments. Despite its naive nature, our simple concatenation shows superior stitching results.

Fig. 9.
figure 9

Qualitative comparison on multiple image stitching.

6 Conclusion

We propose a new RS-aware differential homography together with its spatially-varying extension to allow local deformation. At its core is a novel minimal solver strongly governed by the underlying RS geometry. We demonstrate its application to RS image stitching and rectification at one stroke, achieving good performance. We hope this work could shed light on handling RS effect in other vision tasks such as large-scale SfM/SLAM  [4, 43, 57, 67, 69].