1 Introduction

Several mobile devices capable of capturing videos in various circumstances have become popular in recent years. The handling of these devices is typically not controlled and may include unwanted oscillations, degrading video quality.

Video stabilization [2, 6, 13, 17, 19,20,21, 25, 27, 32] aims to compensate for undesired motion of the camera during video acquisition. Efficient methods for stabilizing videos are important to improve their quality according to human perception, as well as facilitate other tasks such as indexing, search and content retrieval [11, 12, 16].

Techniques and criteria for evaluation of video stabilization must be well established to leverage the state-of-the-art in the field, such that approaches can be improved and compared in an appropriate manner.

Quantitative techniques for the evaluation of video stabilization available in the literature are, in some cases, incompatible with visual perception. In addition, the methods used to evaluate and report the results subjectively are little explored. As main contribution of this work, we introduce and analyze the motion energy image (MEI) for the subjective and objective evaluation of video stabilization methods. One of the reasons why we proposed to apply such method to assess the final results is that it would be very complex to incorporate it as part of the video stabilization pipeline itself. On the other hand, the evaluation is not biased by the technique used to stabilize the video.

The proposed method considers that the video stabilization quality can be estimated by the amount of motion present in the video. Both low-frequency motion and high-frequency motion present in the video are considered, since they may contain instabilities [15]. Object movement is also taken into account; however, this does not represent a problem since the proposed method is used to compare results for a same video sequence (before and after the stabilization process) and not to compare different video sequences. Experimental results show that our method is efficient to evaluate stabilization, differentiating stable from unstable videos. The assessment of the results is more coherent when compared to the metrics used in the literature.

This paper is organized as follows. Relevant concepts and related work are briefly described in Sect. 2. The use of the motion energy image for subjective evaluation of video stabilization is presented in Sect. 3. Experimental results are presented and discussed in Sect. 4. Final remarks and directions for future work are described in Sect. 5.

2 Background

Video stabilization systems of different categories have been proposed in the literature, where the most common are mechanical stabilization, optical stabilization and digital stabilization. Mechanical stabilization uses sensors to detect and compensate for camera shake. Optical stabilization consists of a mechanism to compensate for angular and translational motion, stabilizing the image before it is recorded on the sensor. Digital stabilization is implemented in software without the use of special devices.

In the context of image and video processing, the evaluation process can be classified as either objective or subjective. Evaluation is objective when measured by means of quantitative metrics applied between two images or videos, whereas evaluation is subjective when it is performed by human observers [14]. In both cases, it is typically desired to evaluate stabilization using criteria based on the perception of the human visual system.

2.1 Objective evaluation

Criteria for measuring the amount and nature of the displacement have been proposed to evaluate the quality of video stabilization objectively [28]. Unintentional motion is decomposed into divergence and jitter through low-pass and high-pass filters, respectively. The amount of jitter from the stabilized and original video is compared, as well as the divergence is verified, which indicates the amount of expected displacement. As an overall assessment, the blurring caused by the stabilization process is also considered.

Most of the approaches available in the literature adopt the Interframe Transformation Fidelity (ITF) metric [3, 8, 10, 29, 35], which can be expressed as the average of peak signal-to-noise ratio (PSNR) for each pair of frames in the video. Some recent techniques consider the Structural Similarity (SSIM) index [36] as an alternative to PSNR [8].

Liu et al. [23] employed the amount of energy present in the low-frequency portion of the estimated 2D motion as a stability metric. Rates of frame cropping and distortion were also considered to assess stabilization more generally.

The synthesis of unstable videos from stable videos was proposed to evaluate the stabilization process [30] in order to have the ground-truth of the stable videos. The methods developed in their work were evaluated according to two aspects: (i) the distance between the stabilized frame and the reference frame and (ii) the average of the SSIM between each pair of consecutive frames.

Fig. 1
figure 1

Trajectory of horizontal translation

Fig. 2
figure 2

Extracted from [37]

Frame sequence of a video. a Original video, bd different versions of the stabilized video.

Due to the weaknesses of the ITF metric in motion videos, a measure based on the variation of the intersection of angles between the global motion vectors calculated from the SIFT keypoints [24] was developed to evaluate the video stabilization process [7]. In fixed-camera videos, the ITF metric was considered, however, only for overlapping the frame background instead of the entire frame.

2.2 Subjective evaluation

Many approaches have briefly analyzed the trajectories made by the camera and the trajectories of the stabilized video [5, 9, 22, 26, 31]. These trajectories are usually related to the different factors that compose the estimated 2D motion. Some methods present, for example, the path of the camera through transformations, for instance, translations and rotations. Figure 1 shows an example of horizontal translation path estimated from the original (green) and smoothed (blue) videos.

Fig. 3
figure 3

Mean gray-level frames for the first ten frames. a Original video; b stabilized video

Fig. 4
figure 4

Extracted from [33]

Histograms of motion in the HSV color space.

Fig. 5
figure 5

Main steps of the proposed methodology

From the trajectory, we can observe when a motion occurs and its intensity in the original video, as well as such motion after its smoothing. This type of visualization can be very useful to analyze the behavior of the smoothing step of the motion employed in a certain method. However, its result depends on the technique used in the motion estimation, so that the trajectory does not accurately represent the video motion. Thus, the trajectory may be neither a good alternative for the evaluation of stabilization quality nor an adequate visualization for videos with spatially distinct motion.

Other approaches show frame sequences, usually superimposed by horizontal and vertical lines [5, 8, 10, 22, 26, 34, 37]. Thus, it is possible to check the alignment of a small set of consecutive frames. Figure 2 illustrates an example of this view, where objects intercepted by lines must be more aligned in the stabilized video.

From the sequence of frames, it is possible to analyze the displacement of each frame, in addition to the amount of pixels lost due to the transformation applied to each frame. However, this technique becomes impractical when a large number of frames are considered, making it unfeasible to analyze the entire video.

There are also some approaches that summarize the video with a single image through the mean gray-level frame [18, 38], as shown in Fig. 3. Sharper images are expected for more stable videos. From this view, it is possible to check if the video has motion; however, it is difficult to determine the nature of the motion present in the video.

In a broader context, the visualization of videos is concerned with the creation of a new visual representation, obtained from an input video, capable of indicating its characteristics and important events [4]. Video visualization techniques can generate different types of output data, such as another video, a collection of images, a single image, among others. Borgo et al. [4] reported a review of several video visualization techniques proposed over the last years.

Fig. 6
figure 6

Pseudocolor transformation applied to the images of the average MEIs

In order to help users find scenes with specific motion characteristics in the context of video browsing, the visualization of motion histograms in the hue-saturation-value (HSV) color space [33] was proposed. The motion histograms were obtained with the motion vectors contained in H.264 / AVC. Figure 4 shows an example of such visualization, where each frame of the video is represented by a vertical line, such that the motion direction is mapped into different colors and the motion intensity mapped into brightness values. The disadvantage of this technique is the presence of noise in the motion vectors, introduced by the motion estimation algorithm [33].

Table 1 Video sequences from the first dataset
Table 2 Categories and amount of videos present in the second dataset, proposed by Liu et al. [23]
Fig. 7
figure 7

Difference images for unstable video #4 with \(i=2\). a\(j = 3\), b\(j = 4\), c\(j = 5\), d\(j = 6\), e\(j = 7\)

Fig. 8
figure 8

Difference images for video #4 after stabilization with \(i=2\). a\(j = 3\), b\(j = 4\), c\(j = 5\), d\(j = 6\), e\(j = 7\)

3 Average motion energy image for subjective evaluation

The motion energy image (MEI) is a binary image that represents the presence of video motion in a given region. This occurrence is determined by the difference in the gray levels of the video frames. White pixels denote the presence of motion, whereas black pixels denote its absence [1]. In conjunction with the motion history image (MHI), MEI is generally used in the context of human action recognition in videos [1].

Fig. 9
figure 9

MEI for video #4 with \(i=2\). a Original video, b stabilized video

In this work, we consider the average of the motion energy images obtained throughout the video to assess the amount of motion and to characterize its stability. Figure 5 presents the main stages of our methodology.

For each video frame i, the difference of the gray levels of each pixel is calculated. This is done by considering the preprocessed frames through a Gaussian filter with kernel experimentally set as \(\sigma =5\), which is applied to smooth the frames, so that the difference is calculated without disregarding unnecessary details. In this step, a binary image is obtained, where 1 is assigned to the pixel with difference greater than a certain threshold, and 0 otherwise. This calculation can be seen as a sub-step of the MEI construction, expressed as

$$\begin{aligned} \text {Diff}_{i,j}(x,y) = \left\{ \begin{array}{ll} 1 &{}\quad \text{ if } \texttt {med}(|f_i(x,y) - f_j(x,y)|) \ge T \\ 0 &{}\quad \text{ otherwise } \end{array} \right. \end{aligned}$$
(1)

where (xy) denotes a given pixel and f is the already smoothed frame. In turn, i and j correspond to the i-th and j-th frame indices, respectively. T corresponds to the threshold, experimentally chosen as 10. Finally, med() is a median filter with kernel of size 5, applied to decrease the discontinuities of the differences.

We consider an MEI for each frame i, which is obtained through the differences of the frames within a sliding window of size N, centered in i. The MEI calculation can be expressed as

$$\begin{aligned} \text {MEI}_i = \frac{\displaystyle \sum \nolimits _{j \in \varOmega _i, j \ne i} G(|i-j|,\sigma ) \text {Diff}_{i,j}}{\displaystyle \sum \nolimits _{j \in \varOmega _i, j \ne i}G(|i-j|,\sigma )} \end{aligned}$$
(2)

where G() is a Gaussian function that assigns larger weights to the differences of the nearest frames. \(\varOmega _i\) is the neighborhood of i determined by the sliding window.

Fig. 10
figure 10

Average image of the MEIs for video #4. a Original video, b stabilized video

Fig. 11
figure 11

Histogram of average image of MEIs for video #4

Fig. 12
figure 12

Average image of the colored MEIs for video #4. a Original video, b stabilized video

Fig. 13
figure 13

Average grayscale image for video \(\texttt {\#4}\).a Original video, b stabilized video

In contrast to the MEI calculation typically performed in the literature, we consider the differences from the central window frame. This is done so that motion that occurs more gradually can be captured by MEI.

The window size N is based on the number of frames per second (FPS), in order to always consider the same time interval, expressed as

$$\begin{aligned} N = \frac{\text {FPS}}{n} \end{aligned}$$
(3)

where \(n=5\) is empirically adopted in our work.

The use of a Gaussian function to provide larger weights for the frames closer to the central frame is premised on the fact that oscillations present in unstable videos usually occur more suddenly than a desired motion.

By taking the MEI of each frame, the average image of the MEIs is calculated, where each pixel (xy) is taken as the arithmetic mean of the pixels (xy) of all MEIs of the video. Thus, from the gray-level image obtained, it is possible to verify the amount of motion present in the video, its location and spatial distribution in the frames.

The human visual system can distinguish thousands of colors, however, only a few tens of shades of gray. Thus, a pseudocolor transformation is applied, so that high gray-level intensity values are mapped to red, whereas lower intensities to blue. Figure 6 shows the color mapping used. A more stable video is expected to have less motion, and therefore, a view with colors that are closer to blue than an unstable video is obtained.

Fig. 14
figure 14

Average grayscale image for video #7. a Original video, b stabilized video

Fig. 15
figure 15

Average image of the colored MEIs for video #7. a Original video, b stabilized video

In addition to the visualization, we extracted statistical measurements from the gray-level image in order to obtain an objective metric that characterizes the average amount of motion (AAM) present in the video and that can be used to determine the quality of the stabilization process. For this, we consider the normalized average of the gray-level intensities, which can be expressed as

$$\begin{aligned} \text {{AAM}} = \frac{\displaystyle \sum \nolimits _x^W \sum \nolimits _y^H I(x,y)}{W H L_{\max }} \end{aligned}$$
(4)

where W and H correspond to the width and height of the image, respectively, whereas \(L_{\max }\) is the maximum intensity that a pixel can assume.

The AAM value is normalized between 0 and 1. Higher values indicate a greater amount of motion. Typically, a more stable video should generate a lower AAM value than its unstable version. For visualization purpose, we used the AAM to compare videos before and after the stabilization process. Therefore, we need not be concerned with the interference of moving objects, since this will occur in both videos.

4 Results

This section presents the results obtained from our experiments. Sects. 4.1 and 4.2 describe the results obtained with the subjective visualization and the objective metric, respectively.

Fig. 16
figure 16

Average grayscale image for video \(\texttt {Crowd}_{\texttt {0}}\). a Original video, b stabilized video

Fig. 17
figure 17

Average image of the colored MEIs for video \(\texttt {Crowd}_{\texttt {2}}\). a Original video, b stabilized video

Two datasets were used in our experiments. The first one is composed of fourteen videos, where eleven videos were extracted from the GaTech VideoStab [15] and the others collected separately. Table 1 reports a summary of the first database with videos in alphabetical order. We will refer to the videos of this database as the identifiers assigned to each of them. Table 2 presents the second dataset, proposed by Liu et al. [23], which is composed of 139 videos divided into six categories. We will refer to the videos of this dataset as the name of the category followed by the identifier of each video assigned by the authors.

4.1 Visual representation

Figure 7 presents the difference images by considering several indices within a sliding window for the originally unstable video. Figure 8 displays the images corresponding to the videos obtained after the stabilization process through the YouTube approach [15].

From Figs. 7 and 8, we can observe the occurrence of more white pixels in the difference images of the unstable video, indicating a greater amount of motion. This is even more visible with the increase in the frame distance. These results confirm that the use of the difference between frames with a certain distance can capture motion that is not perceived by comparison of adjacent frames.

Figure 9 shows the MEI for the same frame obtained for both the unstable and the stabilized videos. It can be verified that the MEI summarizes well the images of the differences and that the version of the stabilized video has darker pixels compared to that of the unstable video, which indicates the presence of less motion.

Figure 10 displays the gray-level average image of the MEIs for the unstable and stabilized video #4. From the figure, it is possible to observe an image with darker gray levels and with more defined shapes in the image corresponding to the stabilized video. Similar results were observed in all videos in the database under consideration. Figure 11 presents the histograms of the images shown in Fig. 10, where we can easily distinguish the image from the stabilized and non-stabilized video.

In the following results, we present the images obtained with the proposed method and compare them with the average grayscale of the video frames, as shown in Fig. 3 described in Sect. 2.

Figure 12 shows the color image of the average of the MEIs for the unstable and stabilized video #4. It is possible to observe a greater visual distinction when compared to the gray-level image. For the unstable video, the image contains red regions, which indicates the occurrence of a large amount of motion throughout the video. On the other hand, the image is predominantly blue and green for the stabilized video. Figure 13 shows the result obtained with the average grayscale for the same video. It is possible to notice that the stabilized version is better defined, whereas the unstable video image is more blurred. However, it is difficult to infer how much motion is present in the video from the image.

The drawback of the average grayscale image becomes even clearer in the comparison of the results obtained for the video #7. Figures 14 and 15 show the results of the average grayscale and the average of the MEIs for video #7. From the gray-level image, it is not so easy to differentiate the unstable video from the stabilized one. In fact, the stabilized video seems to have more motion. On the other hand, the stabilized video presents an average MEI image with bluer tones, correctly indicating a smaller amount of motion.

The visual representation proposed in this work is efficient to show the amount of motion present in a video, making possible the evaluation and comparison of different stabilization methods. Our technique is more effective than the simple average of the gray levels of the video frames, which can generate inaccurate results when considering the intentional motion of the camera and small changes in the scene.

Figures 16 and 17 show the results obtained for the average grayscale image and the proposed visual representation in a video for a crowded scene.

From Figs. 16 and 17, we can see the differences between the image versions in the proposed visual representation, before and after the video stabilization. Even after stabilization, we can notice red color in the result, which is probably due to the presence of moving people in the scene. However, stronger tones of red are featured in the unstable version of the video, which characterizes a video in the presence of much motion. The images of the average grayscale, however, show little difference, demonstrating the superiority of our visual representation.

Figures 18 and 19 show the results obtained for the average grayscale image and the proposed visual representation in a video that contains parallax effect.

Fig. 18
figure 18

Average grayscale image for video \(\texttt {Parallax}_\texttt {0}\). a Original video, b stabilized video

Fig. 19
figure 19

Average image of the colored MEIs for video \(\texttt {Parallax}_\texttt {0}\). a Original video, b stabilized video

From Figs. 18 and 19, we can observe that redder tones were obtained in the unstable video version, whereas the image of the average grayscale presents little distinction between the two versions of the video.

Figures 20 and 21 illustrate the results obtained for the average grayscale image and the proposed visual representation in a video with fast translations.

Fig. 20
figure 20

Average grayscale image of the average for video \(\texttt {QuickRotation}_\texttt {0}\). a Original video, b stabilized video

Fig. 21
figure 21

Average image of the colored MEIs for video \(\texttt {QuickRotation}_\texttt {0}\). a Original video, b stabilized video

From Figs. 20 and 21, we can notice that a video in the presence of fast translations tends to have very red tones. Similarly to other cases, lighter tones are obtained in the stabilized version. After stabilization, the visual representation continues with red tones, since there is still a certain amount of motion desired in the video. Again, the visualization of the average grayscale image is not very effective.

Figures 22 and 23 present the results obtained for the average grayscale image and the proposed visual representation in a video with regular scene.

Fig. 22
figure 22

Average grayscale image for video \(\texttt {Regular}_\texttt {0}\). a Original video, b stabilized video

Fig. 23
figure 23

Average image of the colored MEIs for video \(\texttt {Regular}_\texttt {0}\). a Original video, b stabilized video

From Figs. 22 and 23, the image for the stabilized version has considerably lighter colors, once this scene has little movement. We can also notice that redder tones are present in the region where a person is moving.

Figures 24 and 25 show the results obtained for the average grayscale image and the proposed visual representation where the person shooting the video was running at the time of scene acquisition.

Fig. 24
figure 24

Average grayscale image for video \(\texttt {Running}_\texttt {0}\). a Original video, b stabilized video

Fig. 25
figure 25

Average image of the colored MEIs for video \(\texttt {Running}_\texttt {0}\). a Original video, b stabilized video

From Figs. 24 and 25, we can observe that the image tones are very reddish in both versions. This occurs due to the substantial change in the scene and to the motion caused by the person who shoots the video. Notwithstanding, we can notice lighter tones in the stabilized version.

Figures 26 and 27 present the results obtained for the average grayscale image and the proposed visual representation in a video in the presence of zoom.

Fig. 26
figure 26

Average grayscale image for video \(\texttt {Zooming}_\texttt {0}\). a Original video, b stabilized video

Fig. 27
figure 27

Average image of the colored MEIs for video \(\texttt {Zooming}_\texttt {0}\). a Original video, b stabilized video

From Figs. 26 and 27, it is possible to observe that the stabilized version has lighter tones, which demonstrates the advantages of our method.

4.2 Objective metric

Table 3 displays the values of AAM, as well as the ITF values for the original videos and after the YouTube stabilization method [15].

Table 3 AAM and ITF values for the videos from the first dataset

Table 4 shows the AAM and ITF values for the tested videos before and after the stabilization process. Both versions are available in the database. The stabilized version was originally obtained with the method proposed by Liu et al. [23].

Table 4 Mean AAM and ITF values for the videos from the second dataset

From Tables 3 and 4, it can be noticed that the proposed metric has consistent results with the ITF metric in the evaluated videos, which demonstrates that it can be used as an alternative to the ITF. From Table 4, we can see that the mean value of AAM is smaller in Regular, Zooming and Parallax categories, which have a lower amount of movement in their videos.

Table 5 presents the values of AAM and ITF metrics for video #4 stabilized through a simple method, where a Gaussian smoothing filter is applied with different values of \(\sigma \).

Table 5 AAM and ITF values for video #4

From Table 5, it can be seen that the ITF and AAM values decrease with the increase of \(\sigma \). This occurs because the method considered the motion as undesired and corrected most of the motion with increasing \(\sigma \). It is possible to observe that, with \(\sigma = 890\), the ITF obtained with the Gaussian filter is superior to that obtained with YouTube method. However, the video generated with the Gaussian filter is visually more unstable, containing several distortions.

The AAM values, also reported in Table 4, are not smaller than the value obtained with the YouTube method and, therefore, more consistent with the visual result of the video.

5 Conclusions and future work

This work presented a novel visual representation technique based on the motion energy image (MEI) for the subjective evaluation of video stabilization. The representation was constructed from the mean of MEIs calculated for all the video frames, and then highlighted with a pseudocolor transformation. In addition, the average gray-level of the representation, denoted average amount of motion (AAM), was proposed as a new objective metric.

We were able to characterize the amount of spatial motion, as well as its location, present in the video. Assuming an unstable video has greater amount of motion than its stabilized version, we can employ this technique to evaluate the video stabilization process.

The results showed that the proposed visual representation is adequate and expresses well both the amount and location of spatial motion. We compared our representation to the mean gray-level frames in several different scenarios and verified that the representation performed better.

The proposed objective metric presented consistent results. In some cases, the AAM overcame the Interframe Transformation Fidelity (ITF), which is the most commonly used objective metric to evaluate video stabilization methods.

As directions for future work, we intend to conduct experiments with participation of people to validate the proposed visual representation. We also plan to investigate the direction and speed of intensity changes in the video frames through visual rhythms for the subjective evaluation of video stabilization. Finally, we intend to investigate other objective metrics obtained from the visual representation technique proposed in this work, as well as its use in conjunction with visual rhythms for the characterization and evaluation of video stabilization.