Keywords

1 Introduction

Partial Video Copy Detection (PVCD) finds segments of a reference video which have transformed copies. It is a well-known topic in the computer vision field [10, 21]. 2D CNN are main components to design PVCD systems. The systems extract 2D CNN features from frames for the retrieval and matching of videos. The performance characterization of 2D CNN features is a known topic in the computer vision field. However, it has been little investigated for PVCD.

The contributions of this paper are twofold: (i) based on a public video dataset, we provide large-scale experiments with 700 B of comparisons of 4.4 M feature vectors. These experiments report conclusions on the particular PVCD problem consistent with the state-of-the-art of the computer vision field. (ii) the regular protocol for performance characterization is misleading for PVCD as it is bounded to the video level. For a deeper analysis, we propose a method for the characterization of key-frames. This method applies a goodness criterion and a time series modelling. It provides a fine categorization of key-frames and allows a deeper characterization of a PVCD problem.

Section 2 provides a state-of-the-art. Section 3 details our performance characterization work. Conclusions and perspectives are discussed in Sect. 4. Table 1 gives the main symbols and mathematical notations used in the paper.

Table 1. Main symbols and mathematical notations used in the paper

2 Related Work

2D CNN process images into convolutional layers and classify them using fully connected layers. When applied to PVCD, a pipeline embedding the 2D CNN must be defined for video processing Table 2. A first step is to select key-frames with sampling at fixed FPS. Closed key-frames in the temporal domain have redundancy. Adaptive methods have been proposed for elimination of 2D CNN features by K-means clustering or ranked inter-frame distances [1, 19].

Table 2. Overview of PVCD systems using 2D CNN

Key-frames are then processed with pre-trained 2D CNN such as AlexNet, VGGNet (16 and 19), ResNet (50, 101 and 152) and InceptionNet. They process input square matrixes \(\in [224; 299]\) in the RGB colour space. They have different architectures and are delivered into different versions (1 to 4).

PVCD systems extract features from 2D CNN. These features serve for the retrieval and matching of videos. The common approach is to extract the features from the full frames even if a RoI based extraction can be applied [8, 22]. The features can be obtained from (i) the Fully Connected (FC) layers (ii) or the convolutional ones. In the case (i), the Last FC is commonly used for extraction. With convolutional layers (ii), standard methods have been established (e.g. MAC and R-MACFootnote 1 [16]) used in several PVCD systems [8, 22].

The videos are then matched from 2D CNN features. A first approach is to detect the videos from the matching of individual frames [13, 15]. The matching can be made global with a frame-to-frame similarity matrix [1, 4, 6, 8, 15]. In both cases, it is common to apply a \(l_2\) normalization to the features [9, 11, 12, 15] and to match with the cosine similarity or the Euclidean distance. Low-dimensional approximations can be obtained with pooling [19] or PCA [1, 6, 18].

Robustness of 2D CNN features is a key property for the PVCD systems. The performance characterization of 2D CNN features is a known topic in the computer vision field. As a general trend, features extracted from recent 2D CNN perform better [5]. The MAC and R-MAC feature extraction methods are more adapted to the networks having large sizes of convolution layers [2]. The impact of blurring noise has been characterized in [14]. The ability of 2D CNN features to characterize particular images is highlighted in [20].

To the best of our knowledge, comparisons of 2D CNN for PVCD have been addressed only in [11, 12, 15, 17]. The characterization has been done for global matching only. Datasets with a low-level of scalability (e.g. SVD [9]) [11, 12, 17] or unbalanced (VCDB [10]) [15, 17] have been used. The fine characterization of 2D CNN features for PVCD has never been investigated.

3 Performance Characterization of 2D CNN Features

PVCD systems extract and match 2D CNN features. These features serve for the retrieval and matching of videos. Robustness is a key property of these features. It is a well-known topic in the computer vision field, however, it has been little investigated for PVCD. We provide in this section large-scale experiments to address this problem. We will introduce the video dataset and performance characterization protocol. Performance characterization results are discussed and conclusions are compared to the state-of-the-art of the computer vision field. A method for characterization of key-frames is then proposed for a deeper analysis.

3.1 Dataset and Characterization Protocol

For performance characterization, a dataset must be selected. Several main PVCD datasets have been proposed, Table 3 gives a comparison. We have selected the STVDFootnote 2 dataset [13]. This dataset has several key properties (i) it is captured from TV and is almost noise-free allowing a fine control of degradations with synthetic methods (ii) it is the largest dataset of the literature with ten thousand hours of video, 243 references and 1, 688 thousand positive pairsFootnote 3 (iii) it offers a balance distribution between the negative and positive videos (iv) it is delivered with an accurate timestamping for video alignment.

Table 3. Datasets for PVCD performance evaluation. The h, s and N/A stand for in hours, in seconds and not available.

From the videos and groundtruth of the STVD dataset we have applied a pipelineFootnote 4 to extract 458, 750 frames Table 4. These frames have been sampled from negative videos and copied segments and split into a training and a testing set. We have processed these frames with the 2D CNN VGG-16, ResNet50-v1 and Inception-v1 for characterization. These networks are typical for PVCD Table 2. The three common methods Last FC, MAC and R-MAC have been used for extraction with a \(l_2\) normalization resulting in 9 databases for a total of 4.1 M of feature vectors (of dimensions 512-F, 1,024-F, 2,048-F and 4,096-F).

Table 4. Dataset for performance characterization

For matching, we have compared the feature vectors with the cosine similarity SC(XY) (with two vectors X and Y). It is a common measure for matching of CNN features that is time-efficient and robust [3]. With a unit \(l_2\)-norm, it can obtained with a single dot product \(X \cdot Y\). Considering m and n the size of the training and testing set, the brute-force comparison has a complexity O(mn) (requiring 50.5 B of matching per feature database with total 455 B). This can be achieved in some hours with a time-efficient implementationFootnote 5.

We have applied the characterization protocol of [7, 13, 15] to evaluate the individual performance of 2D CNN features. All the extracted frames from the copied segments have been labelled with the references in the groundtruth. The negative frames have no label. The performance evaluation has been computed with the P, R and \(F_1\) scores. That is, the maximum cosine similarity will matter and at least one detected frame is required to detect the video.

3.2 Comparison of 2D CNN Features

Based on the dataset and our protocol, we compare here the accuracy of 2D CNN features. Figure 1(a) gives the \(F_1\) scores, over a threshold on the cosine similarity, of the different 2D CNN with a common feature extraction method (Last FC). For clarification, the top \(F_1\) scores are reported too in Table 5.

Fig. 1.
figure 1

Comparison of 2D CNN with the Last FC (a) \(F_1\) (b) P/R

The separability for the detection is not achieved even if strong scores are obtained. A maximum of \(F_1 \simeq 0.93\) is performed with the ResNet50-v1 network. The different networks present competitive results with a maximum gap of \(F_1 \simeq 0.03\). These results are consistent with previous comparisons of 2D CNN in the state-of-the-art [5]. For further analysis, Fig. 1(b) provides the P/R plot. All the 2D CNN maintain a strong precision at a high level of recall.

Table 5. Comparison of feature extraction methods with the top \(F_1\) scores

For a comparison of the feature extraction methods, Table 5 gives the top \(F_1\) scores of the different 2D CNN with the Last FC, MAC and R-MAC. For VGG-16, MAC and R-MAC outperform the Last FC method with a slight gap of \(F_1 \simeq 0.03\). These methods provide a performance degradation for ResNet50-v1 and Inception-v1 up to a gap of \(F_1 \simeq 0.18\). This can be mainly explained by the larger sizes of convolution layers in the VGG-16 network compared to ResNet50-v1 and Inception-v1. This leads more accurate localizations with the MAC and R-MAC features. An equivalent conclusion is also reported in [2].

3.3 Characterization of Key-Frames with 2D CNN Features

The selection of 2D CNN features has a performance impact. However, another important aspect is the ability of video content to be characterized by these features. Indeed, the characterization protocol for PVCD [7, 13, 15] looks for the maximum cosine similarity between video frames where at least one “good” key-frame is required to detect a video. However, key-frames Fig. 2 with a high-level of noise (a), near-constant (b) or almost duplicate (c) could be difficult to detect. A quantitative analysis of the goodness of key-frames must be established and the regular metrics (P, R and \(F_1\)) are misleading on the task. We will investigate this aspect here by providing a characterization protocol of key-frames with 2D CNN features. The goal is to evaluate the performance accuracy of 2D CNN features when facing a large variability of key-frames for PVCD.

Fig. 2.
figure 2

Examples of key-frames (a) blurred (b) near-constant (c) almost-duplicate (d) foreground/background (e) symmetrical

For the needs of characterization, we propose the goodness criterion of Eq. (1). This criterion maximizes the intra and interclass similarity. X is the 2D CNN feature of a positive frame and \(\{\tilde{X}_1, \dots , \tilde{X}_m\}\) its corresponding near duplicate. \(\{Y_1, \dots , Y_{n_1}\}\) is the set of negative 2D CNN features and \(\{X_1^*, \dots , X_{n_2}^*\}\) the positive ones obtained from the other references. \(SC_{\min }\) and \(SC_{\max }\) are operators to get the minimum and maximum SC between the template X and feature sets. That is, \(\phi (X)\) is definedFootnote 6 \(\in [-1,1]\) and \(\phi (X) > 0\) guaranties a separabilityFootnote 7.

$$\begin{aligned} \phi (X) = SC_{\min }(X, \{ \tilde{X}_1, \dots , \tilde{X}_m \}) - SC_{\max }(X, \{ Y_1, \dots , Y_{n_1} \}, \{ X_1^*, \dots , X_{n_2}^*\} ) \end{aligned}$$
(1)

Every frame X and its near-duplicates \(\{\tilde{X}_1, \dots , \tilde{X}_m\}\) are aligned with a timestamp t having a precision of \(\frac{1}{30}\) second Table 3. The overall set of frames can be modelled with time series Fig. 3. In these series, the \(z_1, \dots , z_{m+1}\) values are derived from \(\phi (X)\). For a given frame X at t, we have \(z_1 = \phi (X)\), \(z_2 = \phi (\tilde{X}_1)\), ..., \(z_{m+1} = \phi (\tilde{X}_m)\). These values can be characterized with statistics (the minimum \(z_{\min }\), mean \(\overline{z}\) and maximum \(z_{\max }\) values of \(z_1, \dots , z_{m+1}\) and their standard deviation \(\sigma \)) and a rate \(\tau \) accounting the amount of positive criteria.

Fig. 3.
figure 3

Modelling with time series

Table 6. Categorization of frames

From statistics (\(z_{\min }\), \(\overline{z}\), \(z_{\max }\), \(\sigma \)) and rates \(\tau \), the frames can be categorized as detailed in Table 6 and illustrated in Fig. 3. The statistics and rates are compared to thresholds \(\alpha \), \(\beta \) obtained with automatic methods as detailed thereafter. The large variably between the 2D CNN features of a given frame can be detected when an outlier \(\sigma \) value appears greater than the threshold \(\alpha \). This constitutes the set of not consistent frames labelled NC. The frames where the separability cannot be obtained with the 2D CNN features are categorized when \(z_{\max } < 0\) then \(\tau = 0\). They are labelled NS. From the NS frames, some worst frames labelled W can be filtered out such as \(z_{\max } < \beta \). The frames where a partial or fully separability could be obtained with the 2D CNN features are categorized when \(\tau \in ]0,1[\) and \(\tau =1\), respectively. They are labelled FS and PS.

Table 7 reports the results of categorization on the training set Table 3. We have applied as thresholds \(\alpha =0.05\) and \(\beta \in [-0.4, 0]\) obtained with automatic methods detailed thereafter. For the experiments, we have extended the number of positive frames from 16, 200 to 486, 000 with a sampling at the full FPS \(=30\). We have used the VGG-16 with the MAC feature extraction method for tradeoff between a strong detection score \(F_1 \simeq 0.92\) Table 5 and the memory constraint. With m and n the numbers of positive and negative frames, the Eq. (1) has a complexityFootnote 8 \(O(m\left( \frac{m+1}{2}\right) + mn)\). This requires \(\simeq 244\) B of matching.

Table 7. Categorization results of the training set at full FPS\(=30\)
Fig. 4.
figure 4

(a) distribution of \(\sigma \) (for \(\alpha \)) (b) times series with \(\tau = 0\) and \(\sigma \le \alpha \) (for \(\beta \))

A total of 50, 844 timestamps/indices have been obtained Table 3. \(\simeq 22 \%\) of frames have been categorized as not consistent NC and worst W. Within the remaining \(\simeq 78 \%\), only \(\simeq 13 \%\) fit with the partial PS or full separability FS. That is, only a very small amount of “good” key-frames appears in the several videos corresponding to the categories PS and FS. \(\simeq 87 \%\) of key-frames are hard to detect from their 2D CNN features not consistent or little discriminant.

The categorization results of applied thresholds \(\alpha =0.05\) and \(\beta \in [-0.4, 0]\). They must be selected carefully, we have fixed them with automatic methods illustrated in Fig. 4. Figure 4(a) plots the cumulative distribution of \(\sigma \) over the 50, 844 indices. The threshold \(\alpha \simeq 0.05\) can be easily obtained with an automatic elbow detection. For clarification, the cumulative rate of indices with \(\tau = 0\) (over all the indices \(\tau \in [0, 1]\)) is given for \(\sigma > \alpha \). \(\ll 1 \%\) of indices have a \(\tau \ne 0\). The threshold \(\beta \) has been fixed to detect outliers for indices with \(\tau = 0\) and \(\sigma \le \alpha \) reference per reference. Figure 4(b) illustrates the method. For each reference, a mean \(\overline{\mathcal {Z}}\) of indices is computed. This mean serves to fix the threshold \(\beta = \overline{\mathcal {Z}}\). The indices with \(z_{\max } < \overline{\mathcal {Z}}\) are categorized as worst frames W. Considering the 243 references Table 3, we have obtained a range \(\beta \in [-0.4, 0]\).

Figure 2 provides examples of key-frames for the different categories. Figure 2(d, e) gives key-frames labelled FS containing distinguished shapes (e.g. background/foreground text). They are easy to detect with 2D CNN features [20]. However, they are difficult to catch from videos as they constitute only \(\simeq 3\%\) of the total amount of key-frames Table 7. Figure 2(b, c) gives key-frames having a worst label W with a near-constant or an altered visual content (e.g. inclusion of logos). Even if they constitute a small part of key-frame \(\simeq 8\%\) Table 7, they must be carefully avoided for PVCD. Figure 2(a) shows a key-frame with a high level of blurring labelled NC. Such key-frames have 2D CNN features with a large variability and little discriminant. They are hard to detect [14]. At last, \(\simeq 65 \%\) of key-frames are categorized as NS. The 2D CNN features of these key-frames cannot be detected efficiently.

4 Conclusions and Perspectives

Based on a large-scale video dataset, this paper gives a performance characterization of 9 common 2D CNN features used for PVCD. The experiments have been driven on 4.4 M feature vectors with 700 B of comparisons. The separability is not achieved on the detection problem even if strong scores are obtained with a maximum of \(F_1 \simeq 0.93\). The different networks present competitive results with a maximum gap of \(F_1 \simeq 0.03\). As a general trend, features extracted from recent 2D CNN such as ResNet50 perform better. A correlation appears between the feature extraction methods and the 2D CNN architectures (e.g. VGG-16 with the MAC and R-MAC features). These different conclusions are consistent with the state-of-the-art in the computer vision field.

From 2D CNN features modelled as time series, a method for categorization of key-frames is proposed. This method allows a deeper characterization of a PVCD problem with 2D CNN features. It provides (i) a fine categorization of key-frames (ii) a characterization of 2D CNN features for separability and consistency (iii) a quantitative analysis of the goodness of key-frames. It highlights the performance limits of 2D CNN features when facing blurred, near-constant or almost-equivalent key-frames. In addition, a large part of key-frames (\(\simeq 87 \%\)) cannot be classified efficiently from 2D CNN features. These limitations will be explored in our future works by investigating the robust key-frame selection and learning of 2D CNN features to further improve the PVCD performance.