1 Introduction

With the development of handheld digital devices, such as digital camera, digital camcorder, or mobile phones, it is universal to capture memorable moments of human life in an effortless manner. Besides, the captured photos and videos can be easily shared via social network services such as Facebook or YouTube. In order to access tremendous amounts of stored visual data, thumbnail images are provided as a preview of the corresponding contents for effective browsing and searching. According to the study in [5], the displayed thumbnails strongly influence the users’ behavior in searching and browsing. Figure 1 shows examples that may produce completely different content understanding according to given thumbnails. The first frames selected as thumbnails from various video clips are shown in Fig. 1a, which are still adopted in most of digital devices. However, the first frame may be blurry due to camera motion and often fail to represent the gist of the video clip. On the other hand, the thumbnails shown in Fig. 1b provide better understanding of the video at a glance. Therefore, selection of the semantically representative frame is essential in video thumbnailing.

Fig. 1
figure 1

Comparisons of thumbnails and its effect on understanding the video content. a first frame selected as a thumbnail, b thumbnail extraction results using our method

The primary goal of the video thumbnailling is extracting the most representative frame, which abstracts the content of the video clip. Existing video thumbnailing methods are classified into two main types: static and dynamic [10]. The static thumbnailing techniques extract a single frame from a video clip to describe the content of the sequence. Conventional methods usually focus on the individual frame quality for keyframe extraction, such as the level of blur, contrast, motion, etc [9, 10, 24]. Recently, more advanced approach has been developed to extract the semantically meaningful keyframe that reflects the theme of the video [8]. But it requires additional information such as video title or tags to obtain representative visual samples from a data. Also empirically thresholds are necessary for candidate keyframe selection. Similarly, video summarization extracts several keyframes to describe the whole content of the video and shows them in the form of a storyboard for effective browsing and searching. Therefore, video summarization is highly related to the studies of video thumbnailing by sharing keyframe extraction techniques. Since the static thumbnail extraction techniques may face the limitation in describing the object movement in a video clip, the dynamic thumbnailing techniques can be considered to generate a video thumbnail by extracting consecutive frames for a few seconds [3, 16]. However, their relatively high computational complexity may limit their use in practice [6, 21].

In this paper, we propose an automatic static and dynamic video thumbnail extraction method which incorporates mid-level information such as location and size of semantic objects as well as low-level information related to scene quality. Therefore, we explore such information to formulate corresponding energy terms. Also, we give preference to the frames whose layouts are similar to other frames when calculating final energy. Finally, the proposed method automatically extracts the representative frame which has a minimum energy cost among all frames.

The contribution of this paper is summarized as follows: (a) We propose an algorithm for static and dynamic video thumbnail extraction. To this end, we formulate the energy terms assessing the mid-level characteristics as well as scene quality. (b) We assume that the frames whose layouts are similar to others are relevant in describing the video. We calculate the proposed scene binary pattern (SBP) descriptor for each frame. Then, we compute the probability of each SBP value, by counting the frequency in the clip. It is used to give preference to those frames in thumbnail extraction.

The remainder of this paper is organized as follows: After Section 2 reviews related work, Section 3 addresses cues to construct energy terms. Section 4 presents the proposed static and dynamic thumbnailing methods. Section 5 discusses experimental results, followed by conclusion in Section 6.

2 Related work

2.1 Video summarization

The basic framework of the video summarization can be briefly described as follows: First, the video sequence is divided into multiple shots by applying shot boundary detection or scene change detection algorithms. For each shot, a single representative frame is extracted as the keyframe. Then, the keyframes are presented in temporal order to build a storyboard.

Earlier work on the video summarization has concentrated on the keyframe extraction by using low-level visual features such as color, texture, shape, and motion [15, 17]. However, the keyframes are selected without regard to its semantic content in these bottom-up approaches. Recently, more advanced approaches have been developed to select keyframes by using semantic analysis [1, 15, 16, 23]. Almeida et al. [1] design a video summarization system for online application, which exploits HSV color histogram directly built in the compressed domain. The system allows user interaction to control the quality of the summaries. Wang et al. [23] present an event driven web video summarization approach based on tag localization and key-shot mining. Ma et al. [15, 16] extract keyframes using a visual attention model for semantic analysis. The employed visual attention model is based on saliency, face, and camera motion. Ngo et al. [18] propose a unified approach for video summarization based on the analysis of video structure and video highlights. Yong et al. [26] present a keyframe extraction method that models semantic context extracted from video frames. To represent the semantic context, low-level features are extracted in blockwise from image segment.

2.2 Video thumbnailing

Unlike video summarization usually displaying several keyframes in the form of a storyboard on a large screen, video thumbnailing aims at displaying a single keyframe or a short video due to limited display space and memory constraint. Lee et al. [12] extract a thumbnail from the H.264/AVC bitstreams in frequency domain directly while considering error propagation. Jiang and Zhang [10, 11] present a spatiotemporal vector quantization method to generate a video thumbnail, where the video time density function and the ICA-based feature extraction method are employed to explore the temporal and the spatial characteristics of video frames, respectively. Another interesting feature is considered in the system, where a frame containing flash illumination is automatically selected as a thumbnail [20]. They take notice that flash illumination is generated at the interesting instant while recording. However, the video with flash illumination is not general in personal video recording. Gao et al. [8] present a video thumbnailing algorithm which reflects the theme of the video content. They notice that the quality-based thumbnail may not be semantically representative. First of all, candidate keyframes are extracted using visual features such as color, motion, face, image quality, and so on. Then, to build the visual theme model, some sample images are obtained by searching visual database using the video tags. The candidate keyframes are compared to the theme model for a semantic ranking, and the highest ranked keyframe is selected as the video thumbnail. Several studies [7, 14, 28] report that there exists an intention gap between the author generated video thumbnail and the user’s query. In order to apply the intention of the user to the thumbnail, Liu et al. [14] propose a query sensitive web video thumbnail generation method, which not only consider visual contents, but also meet the preference of the user. Another approach for web video thumbnail to meet the user’s preference is presented in [28]. The system recommends thumbnails which satisfy both video owners and browsers on the basis of image quality assessment, image accessibility analysis, video content representativeness analysis and query sensitive matching [28]. Craggs et al. [7] present ThumbReels for query-sensitive web video previews. In order to create a preview that contains a users query, the viewers in crowd-sourced temporally tag videos whilst watching them. Al-Hajri et al. [2] provide a variable-sized thumbnail to represent the popular content using viewing statistics derived from personal or crowd-sourced histories of video consumption for fast navigation of the video. Note that these web video thumbnailing methods [7, 14, 28] require user’s query, thus produce query sensitive thumbnail results.

3 Cues to construct energy terms

We focus on the personal home video for video thumbnailing, since it captures real-life events and the usage of unedited videos recorded by consumers is dramatically increasing. We observe that these video contents consist of a single shot, because it is hard to edit while recording using the mobile devices. Therefore, we conclude that it is not necessary to employ scene change detection or shot boundary detection algorithms. In the following subsections, we present seven visual cues to extract representative frame which satisfies both frame quality and the semantic level of the content description. Then we formulate an energy function based on these visual features in each frame for thumbnail extraction.

3.1 Face location (FL) and size (FS)

Face information acts as a primary visual cue, especially in the selection of the representative frame of the video clip. In detail, we obtain the information of face location and size in each frame by using the Viola-Jones face detector [22]. We consider it is more meaningful if a face is located relatively close to the center of the image. Also, we believe that the bigger size of the face is more important than the smaller one in the scene. Thus, we define two energy terms to describe the face information as follows:

$$ E_{FL}(i)=\frac{\sqrt{\left( x_{i,c} - W/2 \right)^{2}+\left( y_{i,c}- H/2 \right)^{2}}}{\sqrt{\left( W/2 \right)^{2}+\left( H/2 \right)^{2}}}, $$
(1)
$$ E_{FS}(i)=1-\left( \frac{{\sum}_{m=1}^{M_{i}}W_{i,face}^{m}\times H_{i,face}^{m}}{W\times H} \right)^{2} , $$
(2)

where M i is the number of detected faces in the ith frame, and x i,c ,y i,c represent the center of the largest face among all detected faces in the ith frame. \(W_{i,face}^{m}\), and \(H_{i,face}^{m}\) represent the width and height of the mth face in the ith frame, and W,H denote the width and height of the video. Note that, as the face in the ith frame is located close to the center of the image, E F L (i) gets close to 0. Also, when the face covers the most of the scene in the ith frame, E F S (i) gets close to 0. Note that if faces are not detected in the frame, both E F L (i) and E F S (i) are equal to 1.

3.2 Object location (OL) and size (OS)

In order to deal with other object classes in the image, we adopt the objectness measure proposed in our prior work [4]. In [4], the local regions in the image are categorized into one of three classes: natural, man-made, and object. We obtain the information of object location and size from the classified object region in each frame. Figure 2 shows the classification results of the object region in the test images. Similar to the face feature mentioned above, we compute the energy terms of the object location and size as follows:

$$ E_{OL}(i)=\frac{\sqrt{\left( x_{i,c} - W/2 \right)^{2}+\left( y_{i,c}- H/2 \right)^{2}}}{\sqrt{\left( W/2 \right)^{2}+\left( H/2 \right)^{2}}}, $$
(3)
$$ E_{OA}(i)=1-\left( \frac{area_{i,obj}}{W\times H} \right)^{2} , $$
(4)

where x i,c , and y i,c represent the center of the largest object region among all detected object regions in the ith frame, and a r e a i,o b j denotes the area of all detected object region in the ith frame.

Fig. 2
figure 2

Examples of detected object region using [4]. Note that the object regions are overlaid in green

3.3 Frame difference (FD)

We observe that the object movement becomes a valuable cue for determining the quality and representativeness of the frame. For instance, an object with fast movement is less preferred than the object with focused one with limited movement. While numerous motion estimation algorithms are available in literature, it requires a high computational cost to estimate motion vectors for every frame. Thus, instead of calculating the motion vectors of all sequences, we simply compute frame difference between current and previous frame and normalize to define E F D (i):

$$ E_{FD}(i)=\frac{{\sum}_{x,y}\left|I(x,y,i)-I(x,y,i-1) \right|}{W\times H}, $$
(5)

where I(x,y,i) represents the normalized pixel intensity at (x,y) in the ith frame. Therefore, if an object is steadily focused, E F D (i) gets close to 0.

3.4 Focus blurness (FB)

Since blurred images are not desirable for keyframe selection, we attempt to reject the blurred images by assigning higher energy. In our energy term, we compute focus blurness as described in [13] to measure the blurness of the image. Based on the assumption that blurred version of the original image loses high frequency components, it is expected to produce little difference between the inherently blurry image and the blurred version of it. Therefore, we define the energy term E F B (i) as follows:

$$ E_{FB}(i)=1-\frac{{\sum}_{x,y}\left|I(x,y,i)-g(x,y)\ast I(x,y,i) \right|}{W\times H}, $$
(6)

where g(x,y) denotes the Gaussian function and “ ∗” denotes the convolution operation. Therefore, blurry images get close to 1, while well-focused images get close to 0.

3.5 Scene steadiness (SS)

In selecting the representative frame of a video clip, it is important to infer the photographer’s intention. Without the help of additional user interaction, we pay attention to repeated frames or relatively steady scenes to analyze the representativeness of the video sequence. Assuming that steady scenes contain more meaningful moments and are highly likely to share similar layouts in the video, we measure the frequency of the similar layouts. Inspired from the LBP feature in [19], we propose the scene binary pattern (SBP) for indexing each frame.

To this end, we first divide each frame into 3×3 blocks. In each block, we calculate the average of gray levels, as illustrated in Fig. 3b. Then, we assign a binary value of each block by thresholding the 3×3 neighborhood of each block g i,n (n=0,...,N−1) centered at the block g i,c of the ith frame. Therefore, as depicted in Fig. 3c, we calculate the SBP of ith frame as follows:

$$ SBP(i)=\sum\limits_{n=0}^{N-1}s\left( g_{i,n}-g_{i,c} \right)2^{n}, $$
(7)

where

$$ s(x)=\left\{\begin{array}{ll} 1, & x\geq0\\ 0, & x<0 \end{array}\right.. $$
(8)
Fig. 3
figure 3

An example of SBP. a original image, b mean of each 3×3 blocks, c thresholding of (b). The SBP of the sample scene is 47 (001011112)

The histogram of the scene steadiness with SBP levels in the range [0,L−1] is a discrete function h(r l )=t l , where r l is the lth SBP level and t l is the number of frames in the sequence having SBP level r l . Thus, the probability of the occurrence of SBP level is given by p(r l )=t l /T, for l=0,1,...,L−1, and T is the total frame number of the video clip. Figure 4 shows an example of SBP values and its probability of occurrence of the test video sequence. Given the probability of the scene, we assign high weights to scenes with low frequency, while low weights are assigned to steady scenes. We obtain the weight of scene steadiness at the ith frame as follows:

$$ W_{SS}(i)=-\frac{\exp(p\left( SBP\left( i \right) \right))-1}{\exp(1)-1}+1, $$
(9)

where the weight is estimated by inverse modeling. Note that W S O (i) ranges from 0 to 1, and illustrated in Fig. 5.

Fig. 4
figure 4

An example of SBP values and its frequency according to the scene. a sub-sampled frame of test video b SBP values of each frame, c the probability of the occurrence of each SBP value

Fig. 5
figure 5

A plot of W S S according to the probability, p(S B P)

4 Video thumbnail extraction

4.1 Static thumbnail extraction

Next, we formulate a energy function based on the energy terms obtained from several visual cues (see Figs. 6 and 7, b to g) described in the previous section. We express the total energy function in the ith frame as the weighted summation of component energy terms as follows:

$$\begin{array}{@{}rcl@{}} E_{total}(i)&=&\lambda_{1} E_{FL}(i) + \lambda_{2} E_{FS}(i)+ \lambda_{3} E_{OL}(i)\\ &&+ \lambda_{4} E_{OS}(i)+ \lambda_{5} E_{FD}(i) +\lambda_{6} E_{FB}(i), \end{array} $$
(10)

where

$$ \sum\limits_{j=1}^{6}\lambda_{j}=1. $$
(11)

Here, λ j is weight of each energy term, and the weight parameters are empirically tuned to obtain a satisfying result. We set λ=[0.15,0.15,0.15,0.15,0.2,0.2] in our experiment.

Fig. 6
figure 6

An example of test video and corresponding energy terms on every frame. a sub-sampled frames, b face location distance, c face size ratio, d object location distance, e object size ratio, f frame difference, g focus blurness, h scene steadiness, i total energy, j final final energy by (12)

Fig. 7
figure 7

An example of test video and corresponding energy terms on every frame. a sub-sampled frames, b face location distance, c face size ratio, d object location distance, e object size ratio, f frame difference, g focus blurness, h scene steadiness, i total energy, j final energy by (12)

In order to give preference to the steady scenes in the thumbnail extraction, we apply the weight of scene steadiness W S S to the total energy function as follows:

$$ E_{final}(i) = W_{SS}(i)\cdot E_{total}(i) $$
(12)

Finally, the proposed method automatically extracts a single frame which has the minimum energy, argmini E f i n a l (i). Figures 6 and 7 describe the effect of the scene steadiness in the final energy.

4.2 Dynamic thumbnail extraction

While static thumbnailing extracts a single representative frame with the minimum energy, dynamic approaches seek consecutive frames that represents the video clip. In our case, we extract consecutive frames that represent the following equation:

$$ \arg\min\limits_{i} \sum\limits_{k=i}^{i+dur-1}E_{final}(k), 0 < i \leq T-dur, $$
(13)

where i denotes the ith frame, T is the total frame number of the video clip, and dur, which is set by the user, denotes the number of consecutive frames.

5 Experiments

5.1 Dataset and Details of Experiments

We collected a total of 13 videos from Youtube, [25] and personal collection, with video resolutions ranging from 640×480 to 1920×1080. The collected videos include both indoor and outdoor scenes, which are generally captured by mobile users. Table 1 shows the details of test videos with total frame length. Although many attempts have been made, there is no standard criteria to evaluate the performance of thumbnailing algorithm. Therefore, to evaluate the performance of the algorithm, we performed extensive subjective evaluation of the proposed method on the collection of videos.

Table 1 Summary of test videos

We performed a user study on the subjective preference similar to that used in [14, 27, 28]. We asked 20 participants in total for the study. In this study, users were shown original video and asked to compare our result with the thumbnail taken at the first frame of the video clip, which is usually adopted in various mobile applications. Then, they were asked to give a score: better, the same, or worse, which means if the thumbnails obtained by our algorithm are better than, the same with or worse than the default thumbnails.

Also, we performed subjective evaluation as described in[27]. During the evaluation, the subjects watched the original sequence in advance for evaluation of each thumbnail. Then, the participants were asked to give a score from 1 to 10 for four items: image quality, accessibility, representativeness and the overall evaluation.

In addition, we conducted another user study to evaluate the proposed dynamic video thumbnailing. We adopt two items for performance evaluation: informativeness and enjoyability [15, 23]. The participants are asked to give a score (1 to 10) to evaluate the informativeness and the enjoyability, respectively. Note that the higher score indicates the more satisfaction on thumbnails. Each test video has three associated thumbnails, one with 90 (3 sec.) consecutive frames from the original video, and the others with 180 (6 sec.), and 300 (10 sec.) consecutive frames from the original video, respectively. Note that if the length of the original video is less than 300 frames, the original sequence is used in the 300 frame test.

5.2 Results and discussion

Here, we present detailed experimental results to demonstrate the performance of the proposed method. We first show static and dynamic thumbnail extraction results in Table 1, by reporting the extracted frame index which has the minimum cost.

Figure 8 shows qualitative comparisons of static thumbnails. We compare our approach with the thumbnails from the 1st frame, Gao’s [8] and Yong’s [26]. Note that we used the keyframe selection module for unedited videos in the framework presented in [8].

Fig. 8
figure 8

Some comparison results. a, e thumbnails from 1st frame, b, f thumbnails obtained by [8], c, g thumbnails obtained by [26], d, h thumbnails obtained by our method

As shown in Fig. 8, the proposed method qualitatively outperforms the existing methods.

Figure 9 shows the results of the subjective preference task. As reported in Fig. 9, the thumbnails generated by our method are generally better than or comparable to others.

Fig. 9
figure 9

The subjective evaluations of the user preference task

Table 2 and 3 shows the subjective evaluation results of the default thumbnail, Gao’s [8] results, Yong’s [26] results, and proposed thumbnailing. In subjective preference task, the participants tend to focus on the principal role of the thumbnailing, which extracts the most representative frame without the photographers additional explanation. Therefore, the similar distributions appear in Fig. 9 and the representativeness score of the Table 3. The overall evaluation scores in Table 4 come from considering the image quality, accessibility, and representativeness. In particular, the proposed method gratifies the general requirements in thumbnail, which not only allows for the image quality of the thumbnail, but also satisfies the accessibility and the representativeness of the video content.

Table 2 Subjective evaluation results of thumbnail image quality and accessibility
Table 3 Subjective evaluation results of thumbnail representativeness and overall evaluation
Table 4 Performance evaluation of dynamic thumbnails according to thumbnail duration

The comparisons of dynamic video thumbnailing results with various thumbnail lengths are reported in Table 4. From the results, we have the following observations.

  • The subjects consider that the 90-frame sequence thumbnails are more enjoyable than the longer sequence of that, because the 90-frame sequence is considered to be sufficient to understand or estimate the content of the video.

  • As expected, there is a trade-off between enjoyability and informativeness. However, we believe that the gaps between average scores of informativeness in Table 4 are acceptable for real world applications.

6 Conclusion

This paper presents an automatic static and dynamic video thumbnailing method through the content-based scene analysis. The proposed method uses features which allows for the image quality and the semantically meaningful representation of the video content. Also, we assume that the steady scenes are more informative, which is calculated based on the SBP. Both static and dynamic thumbnails are automatically extracted, by computing the minimum energy cost. Carefully designed experiments have demonstrated the effectiveness of the proposed method.