Keywords

1 Introduction

Presently, the key frame extraction based on the shot, light, movement descriptor to a variety of methods. Among them, the most common method is lens-based key frame extraction. A video is divided into the lens, the first frame of each scene (or the first and last frames) as the key frame of the lens. This method is relatively simple, regardless of the contents of the lens; the number of key frames is relatively OK (the first frame, last frame, or both are selected), the drawback is less stable, because the first and last frames of each scene are not always able to reflect the main content of the lens. The key frame extraction method based on the lens is studied mainly from two areas: the pixel domain and compressed domain.

1.1 Lens-Based Key Frame Extractions in the Pixel Domain

In the so-called pixel domain, this refers to the space/time domain Compared with transform domain, the video data exists in the form of people’s Daily scene, people’s accustomed features (such as color, texture, shape, and motion vectors). Pixel domain detection is the use of these features to get the clip of a video sequence. The key for the shot segmentation is to find the difference between the different camera images. Currently, we have developed some more mature ways to do key frame extraction, full use of the video data, time/space, global/local, static/dynamic, and other kinds of information. Histogram comparison method is the most traditional and common method. In a continuous video sequence, if there is no special treatment, a small gap is formed between the adjacent two frames. In this way, the characteristics of adjacent frames are also almost the same. There are many algorithms for comparing two frame histogram differences that typically include the Euclidean distance, X the square of detection, dual-threshold comparison method, and the sub-block division method.

1.2 Lens-Based Key Frame Extractions in the Compressed Domain

More and more video data are saved in compressed form such as JPEG, MPEG2X; thus it is necessary to study compressed video sequence key frame detection method. This test is carried out usually two ways:

  1. 1.

    First, full-decompression (e.g., Huffman decoding, DPCM decoding, DCT inverse transform and motion compensation) is used to form a video sequence and then used the pixel domain-based approach to realize key frame extraction. The disadvantage of this method is to calculate more and low efficiency.

  2. 2.

    Second, partial decompression, which directly uses the features in the compressed video data to analyze and process, saving decoding time and reducing the computational complexity at the same time.

Currently, image and video compression aspects of international standards, such as JPEG, MPEG, of H. 261 and H. 263, are based on DCT. DCT is converting the pixel values of the two-dimensional space into two-dimensional frequency domain coefficient values; the frequency domain transform coefficients and the pixel domain are closely related and express the contents of the image frame to a certain extent. Early Arman and others used DCT coefficients to detect MPEG; this method was later extended to the MPEG compressed stream for the shot segmentation.

2 The Improved Key Frame Extraction Algorithm Based on the Lens

2.1 Algorithm Basis

The adjacent image frames of the video sequence have similarity and continuity, which is the theoretical basis of the key frame extraction based on lens. Yang Sheen et al. construct the key frame extraction system accordingly.

Known in the MPEG21/2 international standard (video part), MPEG21/2 video sequence is constituted by a number of image groups (group of picture GOP), and each GOP is composed by a range of the I, P, B frames of mutual interval forecast and generation; in each group, the first frame is always I-frame; I-frame adopts coding information of the image itself, and P, B frames are obtained by the forecast. Each shot must include the I-frame and has been confirmed by experiment (MPEG21/2 video encoding requires an I-frame in 13 frames in every shot. The lens, which is composed of the uninterrupted consecutive frames, and its playing time should be by s unit that can make sense, so calculating the frame rate (24 fps), each lens must include I-frames). Therefore, the key frame established in the lens can completely delete P and B frames and generate video sequence file that is composed only by the I-frame. In addition, considering the image compression in MPEG21/2 standards is based on the DCT, the transformation is the basic unit of 8 × 8 sub-blocks for the transformation, can decode the I-frame to a certain extent, remove the DC coefficients, and restore DC thumbnail. And then adopt the template matching method; use the difference between the thumbnails as a similarity measure between the two frames to achieve the key frame extraction.

2.2 Algorithm Thinking

The analysis found that for the two image frames within the same lens, they are very similar from the statistical sense; two images belong to a different lens, which is very small in similarity. Solutions starting from the sub-block, considering the sub-block in the middle position of each image frame in the video sequence depicts visual information of the scene core, compared with sub-block in the same image frame surrounding the location, and the sub-block information at the center position are more important [1]. As a result, the difference of the sub-block at the center position plays an especially important role in determining the difference between the two adjacent image frames and should be treated specially [2]. This article on the basis of literature introduces the DC coefficient of the weight difference, design, and proposes an improved key frame extraction algorithm based on the lens in the compressed domain [3]. According to the theoretical thinking, the sub-block difference of the image frame at the middle position gets more reference value than the sub-block difference at the peripheral location and constructs weight difference of the DC coefficient schematic diagram (Fig. 69.1).

Fig. 69.1
figure 1

The schematic diagram of 8 × 8 sub-block DC coefficient about weight difference

In Fig. 69.1, each sub-block corresponds to a DC coefficient; the depth degree of color shows the importance of sub-block information in a different location. The deeper the color shows that it is more important position in the current image frame, it is necessary to give a larger weight value of the corresponding DC coefficient in the similarity measure. Finally, the difference of thumbnail as a similarity measure of the two adjacent image frames is calculated as follows:

$$ D(I_{i} ,I_{i + 1} ) = \sum\limits_{k = 0}^{n} {\left[ {H_{i} (k) - H_{i + 1} (k)} \right]^{2} } /\left[ {H_{i} (k) + H_{i + 1} (k)} \right]^{2} $$

Of which, \( Ii \) \( Ii + 1 \) represents the first I and I +1 frame, respectively; \( Hi \) \( Hi + 1 \) Represents the I and I + 1 of the I-frame DC thumbnail histogram information. When D (Ii, Ii + 1) reaches a peak, and then identifies the two I-frames from a different lens, extraction the first frame of the lens as a key frame. The essence of the proposed algorithm can be vividly interpreted as the amplification of the sub-block difference in the image frame.

3 The Analysis of Experiments and Results

For the improved algorithm, the selection of the characteristic parameters and decision rules of determining the key frame is the key. Specifically speaking, how to select the so-called core area scope, how to determine the weight values of the DC coefficients in the range of the core region, and how to select frame difference threshold to extract the key frames. These issues directly affect the performance of the merits of the proposed algorithm.

3.1 Experimental Performances

The recall and accuracy of the test model to measure the improved algorithm in this article were adopted. Retrieving the integrated query, the description, matching, and extraction processing have the possibility of success and failure. According to the principle of pattern recognition, you can get four conditions in Table 69.1, corresponding to the four basic parameters.

Table 69.1 The basic parameter of expressed retrieval ability

Using the basic parameters in Table 69.1 can define the commonly used recall and precision in order to characterize the retrieval performance. Defined as follows:

  • Recall rate = associated with the correct search results

  • All associated with the results = [A/(A + C)] × 100 %

  • Precision = associated with the correct search results

  • All retrieved results = [A/(A + B)] × 100 %

This paper selected the animation, film, advertising, science, education, and other video clips to test the effectiveness of the key frame extraction algorithm designed in this paper. The properties of the test video clips are shown in Table 69.2.

Table 69.2 Test the property of video fragment

3.2 Analysis of Experimental Data

Followed by template matching method, Euclidean distance is divided into sub-block method, I-frame DC coefficient method; retrieval results of the improved algorithm for key frame extraction in compressed video sequences. The detection time represents T (unit: s), the number of key frames K (unit: frames), and video sequence length L (unit: frames).

Experimental results show that, after full decoding, key frame extraction results in pixel domain in the precision of this indicator are slightly better than the compressed domain methods. Rich and complete image information was obtained after full decoding, which sub-block partition method is the best.

But the large-scale decoding of the compressed file may result in longer detection time and less effective real-time detection. Although take compressed domain methods, such as I-frame DC coefficient method and the improved algorithm, it is better than the pixel domain methods in the detection time, and the partial decoded image information is limited, so it is slightly worse in the indicators of precision. From the experimental results, the compressed domain methods in the recall rate showed a good performance; unit time of the recall percentage is higher than the detection method of the pixel domain.

It can be seen from the data in Tables 69.3, 69.4, and 69.5; the improved algorithm in this article has improved retrieval time and the recall rate compared with the traditional division of the sub-block detection method. It increases the sensitivity of the image motion of the center of the lens position, making it more suitable for more intense news documentaries, films, and other local sports video sequences; the key frame in the lens is not missing, but there will be a small amount of redundancy.

Table 69.3 Test result of animation (L = 530, K = 20)
Table 69.4 Test result of film (L = 820, K = 22)
Table 69.5 Test result of AD (L = 410, K = 20)

4 Conclusions

This paper put forward an improved key frame extraction algorithm based on lens in the compressed domain. Considering the encoding characteristics in the compressed video sequence domain, the proposed algorithm only use the DC component of the I-frame information and in accordance with the theory that sub-block difference of the image frame in the middle position is more valuable than the ones of the peripheral location, the proposed algorithm only use the DC component of the I-frame information. Experimental results show that the extracted key frames using the proposed algorithm can better reflect the contents of the video lens.