Performance Characterization of 2D CNN Features for Partial Video Copy Detection

Le, Van-Hao; Delalandre, Mathieu; Cardot, Hubert

doi:10.1007/978-3-031-44237-7_20

Van-Hao Le¹⁵,
Mathieu Delalandre¹⁵ &
Hubert Cardot¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14184))

Included in the following conference series:

International Conference on Computer Analysis of Images and Patterns

485 Accesses

Abstract

2D CNN are main components for Partial Video Copy Detection (PVCD). 2D CNN features serve for the retrieval and matching of videos. Robustness is a key property of these features. It is a well-known problem in the computer vision field but little investigated for PVCD. The contributions of this paper are twofold: (i) based on a public video dataset, we provide large-scale experiments with 700 B of comparisons of 4.4 M feature vectors. We report conclusions for PVCD consistent with the state-of-the-art. (ii) the regular protocol for performance characterization is misleading for PVCD as it is bounded to the video level. A method for the characterization of key-frames with 2D CNN features is proposed. It is based on a goodness criterion and a time series modelling. It provides a fine categorization of key-frames and allows a deeper characterization of a PVCD problem with 2D CNN features.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Effective Video Copy Detection Using Statistics of Quantized Zernike Moments

Efficient copy detection for compressed digital videos by spatial and temporal feature extraction

Article 09 December 2015

Spatio-temporal Features for Efficient Video Copy Detection

Keywords

1 Introduction

Partial Video Copy Detection (PVCD) finds segments of a reference video which have transformed copies. It is a well-known topic in the computer vision field [10, 21]. 2D CNN are main components to design PVCD systems. The systems extract 2D CNN features from frames for the retrieval and matching of videos. The performance characterization of 2D CNN features is a known topic in the computer vision field. However, it has been little investigated for PVCD.

The contributions of this paper are twofold: (i) based on a public video dataset, we provide large-scale experiments with 700 B of comparisons of 4.4 M feature vectors. These experiments report conclusions on the particular PVCD problem consistent with the state-of-the-art of the computer vision field. (ii) the regular protocol for performance characterization is misleading for PVCD as it is bounded to the video level. For a deeper analysis, we propose a method for the characterization of key-frames. This method applies a goodness criterion and a time series modelling. It provides a fine categorization of key-frames and allows a deeper characterization of a PVCD problem.

Section 2 provides a state-of-the-art. Section 3 details our performance characterization work. Conclusions and perspectives are discussed in Sect. 4. Table 1 gives the main symbols and mathematical notations used in the paper.

Table 1. Main symbols and mathematical notations used in the paper

Full size table

2 Related Work

2D CNN process images into convolutional layers and classify them using fully connected layers. When applied to PVCD, a pipeline embedding the 2D CNN must be defined for video processing Table 2. A first step is to select key-frames with sampling at fixed FPS. Closed key-frames in the temporal domain have redundancy. Adaptive methods have been proposed for elimination of 2D CNN features by K-means clustering or ranked inter-frame distances [1, 19].

Table 2. Overview of PVCD systems using 2D CNN

Full size table

Key-frames are then processed with pre-trained 2D CNN such as AlexNet, VGGNet (16 and 19), ResNet (50, 101 and 152) and InceptionNet. They process input square matrixes $\in [224; 299]$ in the RGB colour space. They have different architectures and are delivered into different versions (1 to 4).

PVCD systems extract features from 2D CNN. These features serve for the retrieval and matching of videos. The common approach is to extract the features from the full frames even if a RoI based extraction can be applied [8, 22]. The features can be obtained from (i) the Fully Connected (FC) layers (ii) or the convolutional ones. In the case (i), the Last FC is commonly used for extraction. With convolutional layers (ii), standard methods have been established (e.g. MAC and R-MAC^{Footnote 1} [16]) used in several PVCD systems [8, 22].

The videos are then matched from 2D CNN features. A first approach is to detect the videos from the matching of individual frames [13, 15]. The matching can be made global with a frame-to-frame similarity matrix [1, 4, 6, 8, 15]. In both cases, it is common to apply a $l_2$ normalization to the features [9, 11, 12, 15] and to match with the cosine similarity or the Euclidean distance. Low-dimensional approximations can be obtained with pooling [19] or PCA [1, 6, 18].

Robustness of 2D CNN features is a key property for the PVCD systems. The performance characterization of 2D CNN features is a known topic in the computer vision field. As a general trend, features extracted from recent 2D CNN perform better [5]. The MAC and R-MAC feature extraction methods are more adapted to the networks having large sizes of convolution layers [2]. The impact of blurring noise has been characterized in [14]. The ability of 2D CNN features to characterize particular images is highlighted in [20].

To the best of our knowledge, comparisons of 2D CNN for PVCD have been addressed only in [11, 12, 15, 17]. The characterization has been done for global matching only. Datasets with a low-level of scalability (e.g. SVD [9]) [11, 12, 17] or unbalanced (VCDB [10]) [15, 17] have been used. The fine characterization of 2D CNN features for PVCD has never been investigated.

3 Performance Characterization of 2D CNN Features

PVCD systems extract and match 2D CNN features. These features serve for the retrieval and matching of videos. Robustness is a key property of these features. It is a well-known topic in the computer vision field, however, it has been little investigated for PVCD. We provide in this section large-scale experiments to address this problem. We will introduce the video dataset and performance characterization protocol. Performance characterization results are discussed and conclusions are compared to the state-of-the-art of the computer vision field. A method for characterization of key-frames is then proposed for a deeper analysis.

3.1 Dataset and Characterization Protocol

For performance characterization, a dataset must be selected. Several main PVCD datasets have been proposed, Table 3 gives a comparison. We have selected the STVD^{Footnote 2} dataset [13]. This dataset has several key properties (i) it is captured from TV and is almost noise-free allowing a fine control of degradations with synthetic methods (ii) it is the largest dataset of the literature with ten thousand hours of video, 243 references and 1, 688 thousand positive pairs^{Footnote 3} (iii) it offers a balance distribution between the negative and positive videos (iv) it is delivered with an accurate timestamping for video alignment.

Table 3. Datasets for PVCD performance evaluation. The h, s and N/A stand for in hours, in seconds and not available.

Full size table

From the videos and groundtruth of the STVD dataset we have applied a pipeline^{Footnote 4} to extract 458, 750 frames Table 4. These frames have been sampled from negative videos and copied segments and split into a training and a testing set. We have processed these frames with the 2D CNN VGG-16, ResNet50-v1 and Inception-v1 for characterization. These networks are typical for PVCD Table 2. The three common methods Last FC, MAC and R-MAC have been used for extraction with a $l_2$ normalization resulting in 9 databases for a total of 4.1 M of feature vectors (of dimensions 512-F, 1,024-F, 2,048-F and 4,096-F).

Table 4. Dataset for performance characterization

Full size table

For matching, we have compared the feature vectors with the cosine similarity SC(X, Y) (with two vectors X and Y). It is a common measure for matching of CNN features that is time-efficient and robust [3]. With a unit $l_2$-norm, it can obtained with a single dot product $X \cdot Y$. Considering m and n the size of the training and testing set, the brute-force comparison has a complexity O(mn) (requiring 50.5 B of matching per feature database with total 455 B). This can be achieved in some hours with a time-efficient implementation^{Footnote 5}.

We have applied the characterization protocol of [7, 13, 15] to evaluate the individual performance of 2D CNN features. All the extracted frames from the copied segments have been labelled with the references in the groundtruth. The negative frames have no label. The performance evaluation has been computed with the P, R and $F_1$ scores. That is, the maximum cosine similarity will matter and at least one detected frame is required to detect the video.

3.2 Comparison of 2D CNN Features

Based on the dataset and our protocol, we compare here the accuracy of 2D CNN features. Figure 1(a) gives the $F_1$ scores, over a threshold on the cosine similarity, of the different 2D CNN with a common feature extraction method (Last FC). For clarification, the top $F_1$ scores are reported too in Table 5.

The separability for the detection is not achieved even if strong scores are obtained. A maximum of $F_1 \simeq 0.93$ is performed with the ResNet50-v1 network. The different networks present competitive results with a maximum gap of $F_1 \simeq 0.03$. These results are consistent with previous comparisons of 2D CNN in the state-of-the-art [5]. For further analysis, Fig. 1(b) provides the P/R plot. All the 2D CNN maintain a strong precision at a high level of recall.

Table 5. Comparison of feature extraction methods with the top $F_1$ scores

Full size table

For a comparison of the feature extraction methods, Table 5 gives the top $F_1$ scores of the different 2D CNN with the Last FC, MAC and R-MAC. For VGG-16, MAC and R-MAC outperform the Last FC method with a slight gap of $F_1 \simeq 0.03$. These methods provide a performance degradation for ResNet50-v1 and Inception-v1 up to a gap of $F_1 \simeq 0.18$. This can be mainly explained by the larger sizes of convolution layers in the VGG-16 network compared to ResNet50-v1 and Inception-v1. This leads more accurate localizations with the MAC and R-MAC features. An equivalent conclusion is also reported in [2].

3.3 Characterization of Key-Frames with 2D CNN Features

The selection of 2D CNN features has a performance impact. However, another important aspect is the ability of video content to be characterized by these features. Indeed, the characterization protocol for PVCD [7, 13, 15] looks for the maximum cosine similarity between video frames where at least one “good” key-frame is required to detect a video. However, key-frames Fig. 2 with a high-level of noise (a), near-constant (b) or almost duplicate (c) could be difficult to detect. A quantitative analysis of the goodness of key-frames must be established and the regular metrics (P, R and $F_1$) are misleading on the task. We will investigate this aspect here by providing a characterization protocol of key-frames with 2D CNN features. The goal is to evaluate the performance accuracy of 2D CNN features when facing a large variability of key-frames for PVCD.

For the needs of characterization, we propose the goodness criterion of Eq. (1). This criterion maximizes the intra and interclass similarity. X is the 2D CNN feature of a positive frame and $\{\tilde{X}_1, \dots , \tilde{X}_m\}$ its corresponding near duplicate. $\{Y_1, \dots , Y_{n_1}\}$ is the set of negative 2D CNN features and $\{X_1^*, \dots , X_{n_2}^*\}$ the positive ones obtained from the other references. $SC_{\min }$ and $SC_{\max }$ are operators to get the minimum and maximum SC between the template X and feature sets. That is, $\phi (X)$ is defined^{Footnote 6} $\in [-1,1]$ and $\phi (X) > 0$ guaranties a separability^{Footnote 7}.

$$\begin{aligned} \phi (X) = SC_{\min }(X, \{ \tilde{X}_1, \dots , \tilde{X}_m \}) - SC_{\max }(X, \{ Y_1, \dots , Y_{n_1} \}, \{ X_1^*, \dots , X_{n_2}^*\} ) \end{aligned}$$

(1)

Every frame X and its near-duplicates $\{\tilde{X}_1, \dots , \tilde{X}_m\}$ are aligned with a timestamp t having a precision of $\frac{1}{30}$ second Table 3. The overall set of frames can be modelled with time series Fig. 3. In these series, the $z_1, \dots , z_{m+1}$ values are derived from $\phi (X)$. For a given frame X at t, we have $z_1 = \phi (X)$, $z_2 = \phi (\tilde{X}_1)$, ..., $z_{m+1} = \phi (\tilde{X}_m)$. These values can be characterized with statistics (the minimum $z_{\min }$, mean $\overline{z}$ and maximum $z_{\max }$ values of $z_1, \dots , z_{m+1}$ and their standard deviation $\sigma $) and a rate $\tau $ accounting the amount of positive criteria.

Table 6. Categorization of frames

Full size table

From statistics ($z_{\min }$, $\overline{z}$, $z_{\max }$, $\sigma $) and rates $\tau $, the frames can be categorized as detailed in Table 6 and illustrated in Fig. 3. The statistics and rates are compared to thresholds $\alpha $, $\beta $ obtained with automatic methods as detailed thereafter. The large variably between the 2D CNN features of a given frame can be detected when an outlier $\sigma $ value appears greater than the threshold $\alpha $. This constitutes the set of not consistent frames labelled NC. The frames where the separability cannot be obtained with the 2D CNN features are categorized when $z_{\max } < 0$ then $\tau = 0$. They are labelled NS. From the NS frames, some worst frames labelled W can be filtered out such as $z_{\max } < \beta $. The frames where a partial or fully separability could be obtained with the 2D CNN features are categorized when $\tau \in ]0,1[$ and $\tau =1$, respectively. They are labelled FS and PS.

Table 7 reports the results of categorization on the training set Table 3. We have applied as thresholds $\alpha =0.05$ and $\beta \in [-0.4, 0]$ obtained with automatic methods detailed thereafter. For the experiments, we have extended the number of positive frames from 16, 200 to 486, 000 with a sampling at the full FPS $=30$. We have used the VGG-16 with the MAC feature extraction method for tradeoff between a strong detection score $F_1 \simeq 0.92$ Table 5 and the memory constraint. With m and n the numbers of positive and negative frames, the Eq. (1) has a complexity^{Footnote 8} $O(m\left( \frac{m+1}{2}\right) + mn)$. This requires $\simeq 244$ B of matching.

Table 7. Categorization results of the training set at full FPS$=30$

Full size table

A total of 50, 844 timestamps/indices have been obtained Table 3. $\simeq 22 \%$ of frames have been categorized as not consistent NC and worst W. Within the remaining $\simeq 78 \%$, only $\simeq 13 \%$ fit with the partial PS or full separability FS. That is, only a very small amount of “good” key-frames appears in the several videos corresponding to the categories PS and FS. $\simeq 87 \%$ of key-frames are hard to detect from their 2D CNN features not consistent or little discriminant.

The categorization results of applied thresholds $\alpha =0.05$ and $\beta \in [-0.4, 0]$. They must be selected carefully, we have fixed them with automatic methods illustrated in Fig. 4. Figure 4(a) plots the cumulative distribution of $\sigma $ over the 50, 844 indices. The threshold $\alpha \simeq 0.05$ can be easily obtained with an automatic elbow detection. For clarification, the cumulative rate of indices with $\tau = 0$ (over all the indices $\tau \in [0, 1]$) is given for $\sigma > \alpha $. $\ll 1 \%$ of indices have a $\tau \ne 0$. The threshold $\beta $ has been fixed to detect outliers for indices with $\tau = 0$ and $\sigma \le \alpha $ reference per reference. Figure 4(b) illustrates the method. For each reference, a mean $\overline{\mathcal {Z}}$ of indices is computed. This mean serves to fix the threshold $\beta = \overline{\mathcal {Z}}$. The indices with $z_{\max } < \overline{\mathcal {Z}}$ are categorized as worst frames W. Considering the 243 references Table 3, we have obtained a range $\beta \in [-0.4, 0]$.

Figure 2 provides examples of key-frames for the different categories. Figure 2(d, e) gives key-frames labelled FS containing distinguished shapes (e.g. background/foreground text). They are easy to detect with 2D CNN features [20]. However, they are difficult to catch from videos as they constitute only $\simeq 3\%$ of the total amount of key-frames Table 7. Figure 2(b, c) gives key-frames having a worst label W with a near-constant or an altered visual content (e.g. inclusion of logos). Even if they constitute a small part of key-frame $\simeq 8\%$ Table 7, they must be carefully avoided for PVCD. Figure 2(a) shows a key-frame with a high level of blurring labelled NC. Such key-frames have 2D CNN features with a large variability and little discriminant. They are hard to detect [14]. At last, $\simeq 65 \%$ of key-frames are categorized as NS. The 2D CNN features of these key-frames cannot be detected efficiently.

4 Conclusions and Perspectives

Based on a large-scale video dataset, this paper gives a performance characterization of 9 common 2D CNN features used for PVCD. The experiments have been driven on 4.4 M feature vectors with 700 B of comparisons. The separability is not achieved on the detection problem even if strong scores are obtained with a maximum of $F_1 \simeq 0.93$. The different networks present competitive results with a maximum gap of $F_1 \simeq 0.03$. As a general trend, features extracted from recent 2D CNN such as ResNet50 perform better. A correlation appears between the feature extraction methods and the 2D CNN architectures (e.g. VGG-16 with the MAC and R-MAC features). These different conclusions are consistent with the state-of-the-art in the computer vision field.

From 2D CNN features modelled as time series, a method for categorization of key-frames is proposed. This method allows a deeper characterization of a PVCD problem with 2D CNN features. It provides (i) a fine categorization of key-frames (ii) a characterization of 2D CNN features for separability and consistency (iii) a quantitative analysis of the goodness of key-frames. It highlights the performance limits of 2D CNN features when facing blurred, near-constant or almost-equivalent key-frames. In addition, a large part of key-frames ($\simeq 87 \%$) cannot be classified efficiently from 2D CNN features. These limitations will be explored in our future works by investigating the robust key-frame selection and learning of 2D CNN features to further improve the PVCD performance.

Notes

1.
Maximum Activations of Convolutions (MAC) and Regional-MAC (R-MAC).
2.
http://mathieu.delalandre.free.fr/projects/stvd/pvcd/.
3.
A positive pair $(v_i, v_j)$ is a combination of two partial video copies $v_i$ and $v_j$ [7, 10].
4.
Detailed at http://mathieu.delalandre.free.fr/publications/CAIP2023.pdf.
5.
Experiments on a GPU RTX 2070 (7 GiB for the features/1 GiB for the programs), dataset fully loaded, matching with a fast vector multiplication on all the cores.
6.
The Eq. (1) is defined for $SC(X,Y) \in [0,1]$ with 2D CNN using a RELU function.
7.
No possibility for X to be classified as a false negative (X matched with a negative frame or assigned to another video reference).
8.
With $S(X,X^*) = S(X^*,X)$, the comparison number of m features is $m\left( \frac{m+1}{2}\right) $.

References

Cheng, H., Wang, P., Qi, C.: Cnn features based unsupervised metric learning for near-duplicate video retrieval. In: Open-Access Repository (2021). arXiv:2105.14566
Cools, A., Belarbi, M., Mahmoudi, S.: A comparative study of reduction methods applied on a convolutional neural network. Electronics 11, 1422 (2022)
Article Google Scholar
Gkelios, S., Sophokleous, A., Plakias, S., Boutalis, Y., Chatzichristofis, S.: Deep convolutional features for image retrieval. Expert Syst. Appl. 177, 114940 (2021)
Article Google Scholar
Han, Z., He, X., Tang, M., Lv, Y.: Video similarity and alignment learning on partial video copy detection. In: ACM International Conference on Multimedia (MM), pp. 4165–4173 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
He, S., et al.: Transvcl: attention-enhanced video copy localization network with flexible supervision. In: AAAI Conference on Artificial Intelligence (AAAI) (2023)
Google Scholar
He, S., et al.: A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. In: Computer Vision and Pattern Recognition (CVPR), pp. 21086–21095 (2022)
Google Scholar
Jiang, C., et al.: Learning segment similarity and alignment in large-scale content based video retrieval. In: ACM International Conference on Multimedia (MM), pp. 1618–1626 (2021)
Google Scholar
Jiang, Q., He, Y., Li, G., Lin, J., Li, L., Li, W.: Svd: a large-scale short video dataset for near-duplicate video retrieval. In: International Conference on Computer Vision (ICCV), pp. 5281–5289 (2019)
Google Scholar
Jiang, Y., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. IEEE Trans. Big Data 2(1), 32–42 (2016)
Article Google Scholar
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I.: Fivr: fine-grained incident video retrieval. IEEE Trans. Multimedia 21(10), 2638–2652 (2019)
Article Google Scholar
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, Y.: Near-duplicate video retrieval with deep metric learning. In: International Conference on Computer Vision Workshops (ICCV), pp. 347–356 (2017)
Google Scholar
Le, V., Delalandre, M., Conte, D.: A large-scale tv dataset for partial video copy detection. In: International Conference on Image Analysis and Processing (ICIAP). Lecture Notes in Computer Science (LNCS), vol. 13233, pp. 388–399. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-06433-3_33
Roy, P., Ghosh, S., Bhattacharya, S., Pal, U.: Effects of degradations on deep neural network architectures. In: Open-Access Repository (2023). arXiv:1807.10108
Tan, W., Guo, H., Liu, R.: A fast partial video copy detection using knn and global feature database. In: Winter Conference on Applications of Computer Vision (WACV), pp. 2191–2199 (2022)
Google Scholar
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of cnn activations. In: International Conference on Learning Representations (ICLR), pp. 1–12 (2016)
Google Scholar
Wang, K., Cheng, C., Chen, Y., Song, Y., Lai, S.: Attention-based deep metric learning for near-duplicate video retrieval. In: International Conference on Pattern Recognition (ICPR), pp. 5360–5367 (2021)
Google Scholar
Wang, L., Bao, Y., Li, H., Fan, X., Luo, Z.: Compact cnn based video representation for efficient video copy detection. In: International conference on multimedia modeling (MMM), pp. 576–587 (2017)
Google Scholar
Zhang, C., Hu, B., Suo, Y., Zou, Z., Ji, Y.: Large-scale video retrieval via deep local convolutional features. Adv. Multimedia 2020, 1687–5680 (2020)
Article Google Scholar
Zhang, X., Gao, J.: Measuring feature importance of convolutional neural networks. IEEE Access 8, 196062–196074 (2020)
Article Google Scholar
Zhang, X., Xie, Y., Luan, X., He, J., Zhang, L., Wu, L.: Video copy detection based on deep cnn features and graph-based sequence matching. Wirel. Pers. Commun. 103(1), 401–416 (2018)
Article Google Scholar
Zhao, G., Zhang, B., Zhang, M., Li, Y., Liu, J., Wen, J.: Star-gnn: spatial-temporal video representation for content-based retrieval. In: International Conference on Multimedia and Expo (ICME), pp. 01–06 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

LIFAT Laboratory, RFAI Group, Tours City, France
Van-Hao Le, Mathieu Delalandre & Hubert Cardot

Authors

Van-Hao Le
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Delalandre
View author publications
You can also search for this author in PubMed Google Scholar
Hubert Cardot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Van-Hao Le .

Editor information

Editors and Affiliations

Cyprus University of Technology, Limassol, Cyprus
Nicolas Tsapatsoulis
Cyprus University of Technology/CYENS Center of Excellence, Limassol, Cyprus
Andreas Lanitis
The University of New Mexico, Albuquerque, NM, USA
Marios Pattichis
University of Cyprus/CYENS Center of Excellence, Nicosia, Cyprus
Constantinos Pattichis
University of Cyprus/KIOS Center of Excellence, Nicosia, Cyprus
Christos Kyrkou
Cyprus University of Technology, Limassol, Cyprus
Efthyvoulos Kyriacou
Cyprus University of Technology/CYENS Center of Excellence, Limassol, Cyprus
Zenonas Theodosiou
CYENS Center of Excellence, Nicosia, Cyprus
Andreas Panayides

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le, VH., Delalandre, M., Cardot, H. (2023). Performance Characterization of 2D CNN Features for Partial Video Copy Detection. In: Tsapatsoulis, N., et al. Computer Analysis of Images and Patterns. CAIP 2023. Lecture Notes in Computer Science, vol 14184. Springer, Cham. https://doi.org/10.1007/978-3-031-44237-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-44237-7_20
Published: 20 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44236-0
Online ISBN: 978-3-031-44237-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Performance Characterization of 2D CNN Features for Partial Video Copy Detection

Abstract

Similar content being viewed by others

Effective Video Copy Detection Using Statistics of Quantized Zernike Moments

Efficient copy detection for compressed digital videos by spatial and temporal feature extraction

Spatio-temporal Features for Efficient Video Copy Detection

Keywords

1 Introduction

2 Related Work

3 Performance Characterization of 2D CNN Features