1 Introduction

The video sharing and publishing activities on the Internet are increasing tremendously due to the exponential upswing of multimedia technologies. Protections of original video content by content owners, distributors and publishers have become a high-risk and tough challenge. Detecting and administering the enormous amount of videos which are uploaded every day to the video sharing Web sites such as YouTube, Netflix, etc., is a critical challenge for the owner of the commercial video Web servers. Keyed digital information can be replicated and arbitrarily distributed by an adversary without the consent of the copyright holder. The original video content can be manipulated by an adversary by applying certain distortions such as content-preserving (e.g., lossy compression, contrast enhancement) and geometric (e.g., rotation, scaling) distortions into an original video. Protection and management of highly sensitive digital information have become a critical task. The term copy of a video is a manipulated or transformed video sequence which is similar or less similar but not identical compared to the source video [1]. Due to the advancement of video navigation technology, it has become easier for the users to navigate or find any sequence of a TV show such as finding the opening sequence of a show. Moreover, due to the advancement of video editing softwares or apps such as Final Cut Pro 7 and iMovie, the users can alter the content of a video by combining or editing similar versions of the same video as required, in which the quality of a video might be degraded or improved. It has become hard to detect the near-duplicate video copy of an edited version of the same video in which the quality is improved. Employing the fast and robust method for accurate detection of illegal copy or manipulated version of an original video still remains a challenging task. There are many other real-time application areas such as detection of duplicate Web videos [2] and monitoring of real-time TV commercial [3, 4] media content over multi-broadcast channels, where robust video copy detection scheme is mostly required. Still, manual work is involved in such monitoring and real-time performance is very poor. It is indispensable to employ a copy detection scheme that is both discriminative and robust against various distortions such as picture-in-picture, region cropping, scaling, etc., which is yet a challenging research field.

Numerous researches have been done to cope up with the copyright issues. Watermarking-based [5] copy detection is extensively used. Extra information is embedded into the original content of media before it is distributed imperceptibly [6] and can be extracted later to acquire information of the original video content and to identify the copy of a video content to its original in the watermarking scheme. During the entire distribution process, the information is embedded with the media content and can be used to detect illegal distribution of the content. However, the watermarking scheme has some demerits: (1) Contents without watermark such as legacy content which has already been distributed cannot be traced or detected through watermarking; (2) even a minor alteration induced by the watermark degrades the quality of content which is not suitable for some applications such as detection of digital content involving medical images; (3) there is a trade-off between the imperceptibility and robustness. The watermark should be robust to diverse transformation operations on the digital content [7] which is still not adequate for copy detection. Moreover, a conventional cryptographic hash function [8, 9] is also used for the authentication of a digital signature in which message is identified by a constant and short length of bit feature vector uniquely. As the cryptographic hash function operates on the basis of a whole message it is impossible to obtain the identity and check the integrity of a part of the message of a digital signature. In addition, the feature vector generated by a cryptographic hash function changes substantially when the input message changes by a single bit [10].

Considering this entire pitfall, a robust visual hashing (also called digital fingerprinting) is introduced, for the digital rights management of multimedia data [11]. It is an alternative approach for copy detection and avoids embedding operation unlike watermarking. Generally, the perceptual hashing method is popularly used for the purpose of content-based image retrieval, image indexing and image authentication [12]. Later, this method is adopted as an approach for video copy detection [13, 14] which extracts its fingerprints, called the hash value, by analyzing the signal of a video sequence. This value could permit the unambiguous identification of the signal (e.g., human fingerprint with people). The foremost objective of visual hashing-based copy detection is to extract a compact feature or hash code of fixed size of length from video segments (it can be a key frame or whole frame of a video scene) for identification and differentiating the original video from the manipulated one for copyright management, tracking and organizing giant video databases. To preserve the properties of visual hashing, such as (1) uniqueness; (2) compactness; (3) robustness, a short robust hash code [15] is extracted from video segments and matched using a distance metric for identifying the pirated content. Coskun et al. [16] have come up with a new idea of visual hashing in which both spatial and temporal domain is considered as video frames contain motion information across time and is robust against certain distortions such as content-preserving (e.g., contrast enhancement) and geometric (e.g., frame dropping) distortions. Some literature review paper [17, 18] also elucidated the essence of various video copy detection schemes. Figure 1 roughly depicts the visual example of near-duplicate copy of an original video frame.

Fig. 1
figure 1

Visual example of near-duplicate copy of an original video

This paper is aligned as follows: In Sect. 2, detailed state of the art is discussed on visual hashing-based video copy detection. In Sects. 3 and 4, major challenges and current trends are analyzed and discussed, respectively. Conclusion is presented in Sect. 5.

2 State of the art

Various copy detection methods have been proposed for solving the piracy issues and managing the huge video databases. The visual hashing- or fingerprinting-based copy detection method is more preferable compared to the watermarking-based copy detection method because of their high discriminability and robustness property against various distortions. The visual hashing-based copy detection method extracts the compact hash code or fingerprint that can tell whether a suspicious piece of content matches a multimedia document registered in the fingerprint database. Moreover, unlike watermarking approach, hashing or fingerprinting approach can be applied to the legacy content (content that has already been distributed) of a digital media [14, 15]. Figure 2 represents an overview of the working principles of hashing- or fingerprinting-based video copy detection approach.

Fig. 2
figure 2

Flowchart of hashing- or fingerprinting-based video copy detection approach

In this section, various existing visual hashing-based video copy detection methods are discussed. The methods are classified based on the domains on which the hash codes (digital fingerprints) or feature vectors are extracted, i.e., spatial domain, temporal domain and spatial–temporal domain.

2.1 Based on spatial domain

In spatial domain, the hash code or feature vector is extracted from each key frame [19] or each frame [20] of a video. Spatial feature plays an important role in video copy detection and identification as it can locate the salient points either locally or globally within its spatial space and is robust against common video processing steps [21] such as lossy compression, resizing, frame rate change, etc., as well as geometric attacks (e.g., scaling, rotation) [19]. However, the identification of key frames which will represent the video efficiently is an important issue in spatial domain [19] and it utilizes a large memory space for operating the vast video databases [22]. The time and frame difference (temporal information) [23] which are the salient properties of video frame are not considered in spatial domain.

Further, the methods can be classified based on local features [24], global features [25], coarse features [26] and local–global features [27], according to the features extracted in spatial domain.

2.1.1 Methods based on local features

Here, local descriptors are extracted to form a compact hash code vector such as extraction of the local interest points [24, 28] of the frame image. Local features exhibit high discriminative capability; however, it is less sensitive to global changes [29]. Neelima and Singh [19] introduced a scale-invariant feature transform (SIFT)-based local feature descriptor which is invariant to scaling, rotation and translation applied on the video frame sequence. The invariant key points were first extracted using SIFT descriptor from each key frame that was selected from video frame sequence and then clustered into 32 different clusters. Based on the cluster centroid, thirty-two distinct blocks of pixels of size m × n were generated. Finally, the maximum singular value was extracted as a feature vector by applying singular value decomposition (SVD) on each block. SIFT descriptor is partially invariant to illumination changes and affine transformation which is its drawback.

The interest points were detected using difference of Gaussians (DoG) method in [24], where Gaussian kernel is a scale-space kernel candidate. It detects the repeatable key points so that its pixel location is detectable even after geometric attacks such as scale change, rotation, etc. They have used a cascade of subsampled images (multi-resolution) and filters called octaves. This method is a good approximation of the normalized Gaussian Laplacian with respective variance κ · σ and σ:

$$ G\left( {u,v,k\sigma } \right) - G\left( {u,v,\sigma } \right) \approx \left( {k - 1} \right)\sigma^{2} \nabla^{2} G. $$
(1)

The difference of Gaussians (DOG) image was obtained by convolution of (1) with the frame image as follows:

$$ D\left( {u,v,\sigma } \right) = (G\left( {u,v,k\sigma } \right) - G\left( {u,v,\sigma } \right)) * I\left( {u,v} \right), $$
(2)

where G(uv) is the Gaussian image and I(uv) is the frame image of pixel coordinate (uv). In each octave, they have used an initial scale factor σ of value 1.6 and a multiplicative factor κ of value 1.15. The location of the points of interest was represented by the extrema of the DOG. Only the points with good precise space localization and good contrast were kept [30]. To reduce the storage space and enhance the performance, the interest points were extracted from a key frame [31]. To compute the local descriptor for characterizing each key point, they have applied a circular neighborhood of radius \( \Re \) (e.g., 20) which is invariant to rotation and computed the key point orientation by summing the gradient vectors in a small disk around the key point. The neighboring disk was divided into nine regions using this orientation, and the local histogram of sixteen bins was computed to represent the local descriptor. However, this method is not robust against the global changes such as color variation and incurs high computational cost.

In [28], the proposed method works in a similar way as in [24]. But, the only difference was that they have used an enhanced version of the Harris detector [32] based on a generalized random transform that is invariant to rotation and scale changes. For each interest point, the local description of the region of interest was computed and a distance similarity metric was used that fuses the geometric information and intensity to compare the key frames extracted using a scene detection algorithm. The geometry of interest points was captured as follows.

Let pi = (xiyi) be the ith key point in the frame image, and a weighted average of the separation vectors ξij = pi − pj was calculated for each point pi as follows:

$$ \xi_{i}^{m} = \frac{1}{k - 1}\sum\limits_{\begin{subarray}{l} j = 1 \\ j \ne 1 \end{subarray} }^{k} {\omega \left( {|\xi_{ij} |} \right)\xi_{ij} } , $$
(3)

where k is the total number of interest points in the frame image and ω(·) is a monotonically decreasing function on \( \Re^{ + } \). However, the experimental result shows that the method is not invariant to rotation and global variations.

Li et al. [33] also came up with the concept of extracting the interest points from regions of interest (ROIs) using a new method called FREAK (Fast Retina Keypoint) which is robust against scaling, rotation and noise. Initially, ROIs were extracted using thresholding and morphological merging technique. Later, FREAK points were extracted from each ROI and were normalized to reduce the effects caused by slight inaccurate extraction of ROI as given below:

$$ {\text{nf}} = F\left( {\text{Glf}} \right), $$
(4)

where Glf represents the feature vector of current frame, F(·) is a function to normalize the parameter, and nf is the normalized feature vector. Subsequently, FREAK points were clustered by spectral clustering to reduce the redundancies of features at the same shot and were used as fingerprints. In order to be invariant to rotation, Zhang et al. [34] have introduced a method based on speeded up robust features (SURF) which extracts the local features from the frame image and identified a reproducible orientation for the interest points. In addition, locality-sensitive hashing (LSH) was applied for indexing to enhance the performance of the copy detection which generates the compact hash code by projecting similar hash code into a hash bucket. The high computational cost is the drawback of this method and is also less sensitive to global changes such as color variation. Similarly, the SURF [34] method was also used in [35], in addition to oriented FAST and rotated BRIEF (ORB) [36] method to detect the pirated video content. This method cannot handle the illumination changes.

2.1.2 Methods based on global features

The global features of video sequence were extracted using centroid of gradient orientations (CGO) descriptor proposed in [21]. In this method, each resized frame was partitioned into a grid of m × n blocks and then, CGO was calculated for each block resulting in (mn)-dimensional feature vector. The proposed method is pair-wise independent (not dependent between unrelated video segments) and robust against certain distortions such as lossy compression, frame rate change, etc. However, the global feature descriptors are insensitive to local changes which lead to discriminability issue [29]. Moreover, the method is not robust against general geometric transformations such as rotation, shift, cropping, etc., which needs to be enhanced.

For computing the similarity between images, the ordinal measure (OM) was first introduced in [25] and then extended to video in [37] for copy detection in which the ordinal measure (global feature) was computed from all the N blocks of each frame image, and then each block was sorted according to their average gray level (ranking). The OM M(t) of the tth frame was presented as:

$$ M\left( t \right) = \left( {R_{0} ,R_{1} , \ldots ,R_{N - 1} } \right), $$
(5)

where Ri is the rank of the ith block. Generally, the ordinal measure is robust against the transformations such as noise, filtering, recompression, etc., that are applied to the whole frame. However, it cannot survive the local transformations such as logo insertion, cropping, shifting, etc. The method proposed by Yang and Li [38] works in a similar way as in [21] in which visual features were extracted from each m × n block of each frame using the gradient orientations of luminance centroid. The method is robust against the common video processing steps, but not robust against the geometric distortions.

A copy detection scheme based on quadrant of luminance centroid was introduced in [39] by Uchida et al. in which each frame was divided into 4 × 4 blocks bi(1 ≤ i ≤ 16) and the coordinate of the luminance centroid \( \left( {x_{i}^{\prime } ,y_{i}^{\prime } } \right) \) was calculated for each block as follows:

$$ x_{i}^{'} = \frac{{\sum\nolimits_{{\left( {x,y} \right) \in b_{i} }} {x \cdot I\left( {x,y} \right)} }}{{\sum\nolimits_{{\left( {x,y} \right) \in b_{i} }} {I\left( {x,y} \right)} }},\quad y_{i}^{'} = \frac{{\sum\nolimits_{{\left( {x,y} \right) \in b_{i} }} {y \cdot I\left( {x,y} \right)} }}{{\sum\nolimits_{{\left( {x,y} \right) \in b_{i} }} {I\left( {x,y} \right)} }}, $$
(6)

where I(xy) is the luminance of a frame image at coordinate (xy). Subsequently, a block-level luminance centroid was binarized into a 32-bit quadrant feature and a stable key frame was selected to enhance the pair-wise independence between unrelated video segments. Finally, stable features were compared using adaptive mask. However, this method is not invariant to strong local variations such as cropping, frame shift, etc.

To improve the robustness against rotation and flipping attacks of the work proposed in [40], the Himeur and Sadi [41] have combined the binarized statistical image features (BSIF) local texture descriptor and local color descriptor using weighting parameters to obtain the global descriptor. BSIF histogram was computed from all the rings of each BSIF frame image, and the histogram over hue for every decomposed patch of each frame was computed from the corresponding RGB values of each pixel. This method is not robust against other transformations such as cropping, pattern insertion, etc. They came up again with a new approach in [42] to enhance the performance in which the same ring decomposition-based BSIF [41] method was adopted in addition to invariant color descriptor (ICD). To construct an invariant color description which is robust against the geometric attacks such as rotation and flipping, ICD was applied to the video frames. The method has less discriminative capability that needs to be improved further.

2.1.3 Methods based on coarse features

In this category of methods, the coarse features are extracted to represent the video content, such as extraction of features by nonnegative matrix factorization from each key frame [31], attention region representation by saliency map [43, 44], discrete cosine transform coefficients [26, 45, 46], mean luminance comparison between two adjacent subregions of a ring [47], extraction of contourlet coefficients [48]. The accuracy of detection is not so good, since the coarse features can only identify an approximate representation of the video content [49]. The authors in [31] came up with a new scheme called nonnegative matrix factorization (NMF) which extracts the perceptual fingerprints from each key frame via Gaussian weighting. The transform-invariant NMF (T-NMF)-based video indexes were integrated with the proposed scheme to assure robustness and compactness against geometric attacks and global luminance changes. However, video copy detection based on only key frames cannot yield temporal localization [23] and the method can incur high computational cost.

In [43], the coarse representation of the feature vector was extracted from the visual attention regions which were represented by saliency map. The unique saliency map was formed by combining the normalized visual feature maps such as color maps, intensity maps and orientation maps which were computed from the input frame as given below:

$$ S = \frac{1}{m}\sum\limits_{i = 1}^{m} {N\left( {X_{i} } \right)} , $$
(7)

where N(·) is normalization function, Xi represents a feature map, and S is the combined map. Then, the saliency map was partitioned into a grid of m rows and n columns, resulting in m × n blocks, and the average saliency value of each block was calculated. Finally, the coarse representation of the saliency map was adaptively quantized to a binary vector as the proposed video feature vector or fingerprint. The bottom-up approach was used, which avoids the effect from the top level of human visual system (HVS). This method is robust against content-preserving distortions, but not robust against the geometric distortions. The method introduced in [44] was based on the similar concept used in [43], but the difference was in the use of self-information-based method to create the visual saliency map. In this method, the saliency of a location was quantified by the self-information of an m × n local image patch which was centered on that location and then introduced salient covariance matrix (SCM) descriptor as a robust and compact feature descriptor for video copy detection. The high computational cost and less discriminative capability are the main disadvantages of this method.

The discrete cosine transform (DCT) used in [26, 45] for extracting DC coefficients of luminance component in each block of frame was later adopted by the authors in [46] with a slight improvement. They have used color layout descriptor (CLD) which is a robust and compact frame-based descriptor that captures the frequency content in a highly coarse representation of the frame image. The CLD feature was obtained by converting the frame image to an 8 × 8 image along each (YCbCr) channel on average. DCT was computed for each frame image, and the DC along with the first five AC coefficients (in zigzag scan order) for each channel was selected to form an 18-dimensional CLD feature vector and was further encoded by vector quantization (VQ). Significant gamma variation and cropping can, however, distort the CLD adequately to cause errors, which is the main disadvantage of this method.

A texture descriptor called region binary pattern (RBP) was proposed by Kim et al. [47]. The method extracts the two complementary region binary patterns from subregions of several rings of a key frame to preserve the spatial structure and is robust against the rotation and flipping. A key frame was divided into several rings, and each ring was further divided into subregions from which the RBPs were extracted. The first (intra-type) RBP represents a binary pattern in a single ring, and the second (inter-type) RBP was computed from a relationship between adjacent rings by calculating the mean luminance of subregions in a ring. The spatial distribution is not considered because it is not invariant against the global changes and other transformations such as frame dropping, logo insertion, etc.

The contourlet transform hidden Markov tree (CHMT) model was proposed by Sun et al. [48]. In this method, each resized frame of a video was partitioned into a grid of m × n blocks. Subsequently, each block was transformed into contourlet coefficients, and then the standard deviation matrices of the CHMT model were extracted as the intermediate feature. Finally, SVD [19] was applied to reduce the dimension of the standard deviation matrices and the largest singular value of each matrix was taken as the feature vector. Using few parameters, the CHMT model can capture all inter-scale, inter-direction and inter-location dependencies of the counterlet coefficients and is robust against common content-preserving operations such as lossy compression, filtering, etc., but not robust against the geometric attacks (e.g., frame dropping, rotation, etc.).

2.1.4 Methods based on local and global features

In this class of methods, both the local and global features are extracted from the spatial domain of the video frames to preserve the robustness and discriminability properties of the feature descriptors, which is the main issue faced while using only local features and global features, respectively. To meet both the robustness and discriminability properties, the authors in [27] introduced a method in which they have used similarity-preserving multimodal hash learning (SPM2H) for generating compact hash code. In this scheme SIFT [19] and pyramid histogram of oriented gradients (PHOG) were used to extract the local features and global features from each key frame, respectively, and then, SPM2H was applied for combining both the features to generate low-dimensional compact hash code with good accuracy. The method used in [20] was also used by Ding and Nie [29] for copy detection with a slight difference. In this method, interest points were extracted using SURF [34] local feature descriptor from each key frame of the video to reduce the computational cost. Then, each key frame was divided into an equal-area circle ring based on the center point which can be found from the interest points and key frame boundaries. Further, each circle ring was divided into an equal-area sector, and from each sector, the ordinal measure (global feature) [37] vector was computed, which was taken as a fingerprint. The authors in [50] have adopted the SIFT [27] descriptor for local feature extraction to estimate the copy transformation, and then, ordinal measure [37] was used as a global feature to accelerate the copy detection subsequently. The random sample consensus (RANSAC) algorithm was used to estimate the affine transformation that was used to map the points in query frame to those in its reference frame. Subsequently, the mismatched local feature points were removed by the same algorithm. Chiu et al. [51] also adopted the same method as in [50] along with the segment-based similarity matching technique for copy detection. Computational complexity is the main demerit of these methods.

To detect the copy–move forgery (CMF), two texture descriptors known as cellular automata (CA) and local binary pattern (LBP) were introduced by Tralic et al. [52]. The core idea was that CA learns a set of rules for all the overlapping blocks of each frame which describes the intensity changes in every block appropriately, and then histogram of rules was created, which can be used as a feature vector for forgery detection. Binary representation of feature vector was obtained by using LBP locally to every neighborhood of each block which has led to a remarkable reduction in the number of possible rules before histogram was created. The method is unable to detect when a part of a frame is being copied and pasted to a different frame in the same sequence as it has considered the whole frame. In the method proposed in [53], ORB [36] descriptor was used to extract the local binary features vector from each key frame, whereas color correlation histogram and key frame thumbnail were introduced to extract the global features vector for copy detection. To select a matched video, the corresponding features vector similarity was evaluated in an intuitive voting system which requires at least two matched feature vectors. ORB has a good performance at low cost compared to SURF and SIFT [36]. However, ORB descriptor shows less resistance against image distortion, illumination changes and changes in scale.

2.2 Based on temporal domain

The methods based on temporal domain extract the visual features or hash values between two consecutive video frames in the temporal direction [34]. It is obligatory to consider the temporal information which is linked with video frames intrinsically to acquire the frame-level representation entirely, since videos represent motion-based features across time typically. Temporal localization is important for locating actions precisely in time, even when the surrounding frames are visually similar [23]. Some demerits arise due to the use of temporal information only for video copy detection: (1) It cannot be applied to the short-time duration video segments as it is feasible only with the long-duration videos and (2) it is not suitable for online applications that are of short-time duration videos [22]. A global descriptor in the temporal domain was extracted by the method proposed in [1] and used for the fingerprint. In this method, the feature value of tth frame was calculated as a weighted sum of per pixel squared differences of the corresponding t and t − 1 frames as given below:

$$ V\left( t \right) = \sum\limits_{i = 0}^{N - 1} {B\left( i \right)} \left( {I\left( {i,t} \right) - I\left( {i,t - 1} \right)} \right)^{2} , $$
(8)

where B(i) is a weight function to improve the significance of the central pixels, N is the number of pixels for each frame, and I(it) (i = 0, 1, …, N − 1) is the pixel’s intensity of the tth frame. The fingerprint was computed around the frame with maximum temporal activity V(t), and the spectral analysis by FFT leads to a 16-dimensional vector which was based on the phase of the temporal activity or feature value. This method uses only the content relation in the temporal domain and is not robust against local distortions such as region cropping, frame insertion, etc.

The ordinal measure [37] used for global feature descriptor in spatial domain has been extended to the temporal domain [54] by ranking the regions (blocks) along the temporal axis or time. If each frame was divided in N blocks and if λn the ordinal measure of the region n in a temporal window with the length T, the distance D between a reference video Vrf and a query video Vq at time t was given as follows:

$$ D\left( {V_{\text{q}} ,V_{\text{rf}}^{\text{s}} } \right) = \frac{1}{N}\sum\limits_{n = 1}^{N} d \left( {\lambda_{\text{q}}^{\text{n}} ,\lambda_{\text{rf}}^{{{\text{s}},{\text{n}}}} } \right), $$
(9)

where

$$ d\left( {\lambda_{\text{q}}^{\text{n}} ,\lambda_{\text{rf}}^{\text{s,n}} } \right) = \frac{1}{{C_{T} }}\sum\limits_{i = 1}^{T} {\left| {\lambda_{\text{q}}^{\text{n}} \left( i \right) - \lambda_{\text{rf}}^{\text{s,n}} \left( {s + i - 1} \right)} \right|} . $$
(10)

Here, s is the tested temporal shift and CT is the normalizing factor. The best temporal shift s between two consecutive frames was selected. This measure is robust against certain transformations such as time shifting, recompression, etc., but cannot tolerate transformations that change a subset of the frames in the video clip such as frequent region cropping, insertion of large area, etc.

Radhakrishnan and Bauer [55] introduced a subspace projection scheme for extracting the fingerprints from the group of frames for each time interval in the video in which the basis vectors of a coarse representation were generated using SVD [19] first. Then, a subspace representation of the input video frames was obtained by projecting the coarse representation of the video frames onto a subset of the basis vectors. Finally, the fingerprint was generated by projecting a temporal average of these representations onto the pseudo-random basis vectors. The temporal average Ta of (R s0 R s1 , …, R s M−1 ) was computed as given below:

$$ T_{\text{a}} \left( z \right) = \frac{1}{M}\sum\limits_{i = 0}^{M - 1} {R_{i}^{s} } \left( z \right),\quad z = 0,1, \ldots ,M - 1, $$
(11)

where R s i (·) is the coarse representation of the video frames. The top Z values of Ta are selected for the Tt time interval. This method is not robust to certain transformations such as illumination changes, region cropping, etc.

In [56], the authors have come up with an idea in which the video signature or compact hash value was extracted based on the temporal variation or shot change position of the video files. The anchor frames that represent video temporal structure (signature) were extracted using cumulative luminance histogram difference (CLHD) and statistics collected in a local window along with an adaptive threshold after temporal subsampling of the video frames. Later, to achieve fast matching of signatures, an efficient suffix array data structure was applied. The method does not work well for video contents with lots of gradual transitional effects and object movement.

Similarly, the motion vector along the temporal direction of a video was extracted using the combination of mean of the magnitudes of motion vectors (MMMV) and mean of the phase angles of motion vectors (MPMV) methods proposed in [57]. This method does not produce precise result when motion vectors are extracted from consecutive frames with a high capture rate. To solve the problem faced by methods that is based on key frames or frame-by-frame, Wang et al. [58] came up with a new concept in which the temporal context of key frames was expressed as binary codes. The surrounding frames of each key frame were clustered into two groups based on their temporal relationships with the center key frame which was then used for generating a binary code that represents the temporal context of center key frame. Before matching, the key frames were first projected into distinct buckets by locality-sensitive hashing [34] technique and the distance between the temporal context binary codes (TCB) of key frames that are in the same bucket was computed using hamming distance metric in the stage of sequence matching. The complexity of this method is in finding a robust key frame that uniquely represents the video sequence.

2.3 Based on spatial–temporal domain

The features that are extracted from the spatial and temporal domains play a crucial role in video copy detection, respectively, but there exist so many shortcomings which use respective spatial- and temporal-based methods for video copy detection. To overcome those shortcomings, several methods have been proposed by many researchers considering both spatial and temporal information of videos to yield better performance result. Taking into account all those shortcomings and challenges, Coskun et al. [16] came up with an idea in which video sequence’s luminance component was transformed by 3-dimensional discrete cosine transform (3D-DCT) or 3-dimensional random bases transform (3D-RBT) methods. The low-pass transform coefficients were ordered and quantized using the median of the rank-ordered coefficients, generating 4 × 4 × 4 binary bits for each 3D cube. This method can resist some temporal transformations such as frame rate change or frame dropping and be robust against certain spatial transformations such as recompression, contrast change, etc., but cannot tolerate the manipulations that destroy the spatial and temporal information such as picture-in-picture, frame insertion, etc.

A new approach called temporally informative representative image (TIRI) was introduced in [59,60,61] for copy detection that represents a short segment of the video and contains spatial–temporal information about the video segment. The pixels of TIRI for each video segment were generated as a weighted sum of the frames as follows:

$$ I_{u,v} = \sum\limits_{i = 1}^{K} {\alpha_{i} } l_{u,v,i} , $$
(12)

where lu,v,i is the luminance value of the \( (u,v) \)th pixel on the ith frame in a segment of K frames and αi is the weight associated with each frame. Then, the TIRIs were segmented into overlapping blocks of size w × w and the first vertical and the first horizontal DCT coefficients (features) were generated from each block using 2-dimensional discrete cosine transform (2D-DCT) [62, 63]. To enhance the similarity search performance, the inverted-file-based and cluster-based similarity search approaches were applied. Devi et al. [64] have adopted the same TIRI-DCT [59] method in addition to the low-pass band coefficients (features) that were extracted using discrete wavelet transform (DWT) [65] from each block of the TIRIs to enhance the performance result. The authors in [66] also adopted the same method in which the output of TIRIs was first transformed into R, G and B channels and was then partitioned into s × s blocks. Then, color correlation was extracted and the percentage of number of pixels belonging to a particular group was computed which was again normalized to obtain the color correlation histogram as a feature vector. Similarly, in [67], the key frames were generated by applying TIRI transform onto the preprocessed video to preserve the spatiotemporal information. The method also reduced the feature vector size as well as decreased the computing time. The local textural descriptors were extracted from each key frame using Weber binarized statistical image features (WBSIF), and histogram was computed for each key frame. The final feature vector was computed by concatenating the k number of WBSIF histograms. In [68], the authors have adopted the similar TIRI transformation of the video sequence in which the proposed Shearlet-based video fingerprint (SBVF) method was applied to generate the fingerprints that preserve both the spatial and temporal properties. The SBVF was built by the Shearlet coefficients in Scale-1 (lowest coarse-scale) for unveiling the spatial features and Scale-2 (second lowest coarse-scale) for unveiling the directional features. Inverted index file (IIF) hash searching approach was used for comparison and performance evaluation. However, the TIRI could not represent the video information effectively as the methods did not take the scene change into account. Moreover, the overlapping blocks generate a huge number of TIRIs, which leads to a large amount of redundant information.

The concept of generation of a TIRI [59] and representative saliency map (RSM) [69,70,71] for spatial–temporal-based video copy detection was replaced by generation of a temporally representative frame (TRF) [72] using temporally visual weighting (TVW) method based on visual attention [43] proposed in [73] by Liu et al. to generate a compact hash value that provides better performance and was further improved by the authors in [74]. Here, they have fused both the visual appearance and visual attention features using a deep belief network (DBN) to gain the compact hash value that represents the whole video. The visual appearance feature was extracted from each block of the TRFs directly, while the visual attention feature was extracted from each block of the RSMs of the video in which the Gaussian mixture model (GMM) was used to derive the dynamic attention model, whereas static attention model was created based on intensity, texture and color features to create a saliency map. TRF of a video segment was generated as given below:

$$ F\left( {x,y} \right) = \sum\limits_{i = 1}^{K} {\omega_{i} } \cdot F\left( {x,y,i} \right), $$
(13)

where F(xy) is the intensity of the \( \left( {x,y} \right) \)th pixel of the ith frame of a video segment with K frames and ωi is the temporally visual weight which was computed based on the strength of the visual attention shift. F(xy) is the intensity of the TRF. RSM was also generated in the same way as TRF:

$$ {\text{RSM}}\left( {u,v} \right) = \sum\limits_{j = 1}^{W} {\alpha_{j} } S\left( {u,v,j} \right), $$
(14)

where S(uvj) is the luminance value of the \( (u,v) \)th pixel of the jth saliency map of the video segment that has K frames, αj is the temporal visual weight, and RSM(uv) represents the luminance value of pixels of the RSM. However, the frequent frame insertion and large area of region cropping will affect the method.

In [75, 76], a 3D-DWT-based method was proposed to overcome the limitations and inefficiency faced by the 2-dimensional discrete wavelet transform (2D-DWT) [65, 77] as video has a 3-dimensional vector form. In this method, a hash of group of frames was computed from the spatial–temporal low-pass (LLL) band obtained by applying the 3D-DWT on a video segment which serves as the spatiotemporally informative images (STIRIs) for the segment as the method involves weighted temporal averaging inherently. The STIRI was partitioned into overlapping blocks of size b × b, and then, blocks were shuffled using a secret key k to derive a frame f. The DCT [62] was applied on the overlapping blocks of STIRI for decorrelation of the correlated wavelet coefficients, and then, the hash was computed from the DCT coefficients. However, this method is not robust against the geometric manipulations such as rotation.

Since the interest points can represent a video sequence’s salient contents, the methods in [30, 78, 79] have used not only the spatial interest points [24], but also the temporal interest points along the time axis to achieve higher robustness against the content-preserving as well as geometric attacks. These spatial–temporal interest points correspond to points in which the image values have remarkable local variation in both the space and time. An improved version of the Harris interest point detector [80] was used for extracting the interest points, and a differential description of the local region around each interest point was created. The points that have significant corresponding eigenvalues were considered salient. However, this method incurs high computational cost and the synchronization between two salient points can easily be broken in geometric attacks as some points can be replaced by new ones. Chen and Chiu [81] also used the same methods as in [30], but the only difference was that spatial–temporal interest points were detected in visual attention region [73]. In order to remove the noisy feature points, the geometric constraint measurement was employed for bidirectional point matching. Similarly, the authors in [82,83,84] used the spatial–temporal interest points [30] for detecting the local interest point of regions. In [82, 83], the Kanade–Lucas–Tomasi (KLT) [85] feature tracker was used for tracking the Harris points to get the stable local feature points trajectory. In [84], the local fingerprints were extracted using contrast context histogram (CCH) in local regions around each interest point by evaluating the intensity differences between the center pixel and other pixels. These methods incur high-dimensional and computational complexity.

In video copy detection, the computational complexity that arises due to the high dimensionality of hash or feature vector plays a crucial role which affects the performance of the methods up to a great extent. To solve this issue, Nie et al. [86] introduced a high-order tensor model-based projection technique that exhibits assistance and consensus among different features, and then video tensor was decomposed via the Tucker model. This method outperforms the projection-based video hashing approach in [87,88,89,90]. Subsequently, the comprehensive feature was computed by the low-order tensor that was acquired from tensor decomposition, and finally, the video hash was generated using this feature. The tensor-based projections can give good robustness while capturing the spatiotemporal essence of the video effectively for discriminability [87]. However, the random frame insertion and large amount of illumination change can distort the robustness of this method.

The spatial ordinal measure [37] has been extended to the temporal domain [91, 92] by ranking the blocks along the temporal or time axis to generate the robust fingerprints for accurate matching between the original and pirated videos. This method cannot handle the certain transformations such as frequent region cropping, frame insertion, etc. Lee et al. [93] introduced a video copy detection method based on combined histogram of oriented gradients (HOG) descriptor and ordinal measure [92] representation of the frame. HOG descriptor was used for object detection and for describing the global feature of frames in video sequence. Ordinal measure histogram (OH) was used for generating the feature vector of entire video sequence as temporal feature which is robust against the color shifting and size variations. There is a trade-off between robustness and discriminability. In [94], the proposed method extracted the spatiotemporal compact feature STk from the key frames of a video by abrupt change of luminance as follows:

$$ \begin{aligned} {\text{ST}}_{k} &= \left\{ {\Delta_{q} \left( {k,1} \right),\Delta_{q} \left( {k,2} \right), \ldots ,\Delta_{q} \left( {k,9} \right),D_{k} } \right\}, \\ &\quad {\text{for}}\quad k = 1, \ldots ,K,\end{aligned} $$
(15)

where Dk is the temporal interval between the current key frame Fk and the prior selected key frame Fk−1, and Δq(km) is the luminance differences of 9 blocks in key frame. The complexity of this method lies on the selection of a robust key frame.

The problem of efficient searching for highly deformed videos in small datasets also affects the performance of the methods in video copy detection system. To address this problem, Douze et al. [95] introduced a spatiotemporal post-filtering scheme in which the matched frames were grouped into sequences and the matches which are not consistent in terms of scaling and rotation with the dominant hypothesis for database image were discarded using weak geometry consistency (WGC) strategy. In this model, temporal shift was first determined based on 1-dimensional Hough voting strategy and then, spatial component was determined by estimating 2-dimensional affine transformation between the matching video sequences, respectively. Here, the local patches or salient interest points were detected using Hessian affine region detector (HARD) firstly, and then the pattern of the surrounding local regions was described by SIFT [19] or center-symmetric local binary pattern (CS-LBP) descriptors. Subsequently, the descriptors were clustered to form a bag-of-features and the matched frames were computed based on Hamming embedding method. This method does not give importance toward the frequent frame deletion and region cropping.

The authors in [96] have proposed a method that fuses the spatial and temporal information of a video sequence. Here, the spatial fingerprint was extracted using the so-called method TIRI-DCT [61] and the temporal fingerprint was extracted using the temporal variances (differences) V. Subsequently, the temporal strength TS of V was extracted which was used to determine the importance of temporal fingerprints at the stage of modality fusion adaptively. This method overcomes the limitations of the previously developed methods which have used only the pre-specified weights for combining spatial and temporal information. One of the main issues related to this method is as follows: If the gap between the temporal strengths of the compared temporal fingerprints is big, then the temporal fingerprints can easily be distinguished from each other. Similarly, in [97], three spatiotemporal parameters, i.e., color space, frame partitioning and sampling frame rate, were evaluated for video copy detection based on normalized average luminance descriptors. This method is limited to the content-preserving distortions and is not robust against the geometric distortions such as frame deletion, rotation, scaling, etc. Moreover, reduction in the sampling frame rate and increasing the number of frame partitions can lower the efficiency as well as the performance of the method. Several methods such as that based on video tomography and bag-of-visual-word [98], histogram of oriented gradients (HOG) and compression properties [99], identifying shot-based semantic concepts along the temporal axis [100] and self-similarity matrix (SSM) [101] also have been proposed by many researchers, respectively, which exploits both spatial and temporal information in a video clip or sequence to yield the better performances for robust video copy or forgery detection.

2.4 Other methods

2.4.1 Learning-based approaches

Ye et al. [102] introduced a new learning-based hashing called structure learning for indexing the large-scale multimedia data. The idea behind this approach was to leverage data properties and human supervision based on some known training datasets to derive a compact and accurate hash code. This method was based on supervised learning in which structure information exploits both the discriminative local visual patterns occurring in video frames that are connected with the same semantic class and temporal consistency over successive frames. The idea of this method was further improved by Chen et al. [103], where they have developed a multilayer neural network to learn discriminative and compact hash codes. This methodology exploits both the nonlinear relationship between video samples and the structure information between distinct frames within a video. In addition, the intra-video similarity was also taken into consideration. To further improve the performance, a subspace clustering method was employed to cluster the frames into distinct scenes. The motion information such as optical flow is not considered in this method which can degrade the performance.

2.4.2 Deep neural network-based approaches

Another learning-based hashing method called deep video hashing (DVH) was proposed by authors in [104], which learns binary codes for the entire video in a deep network to exploit both discriminative and temporal information of videos. The method was designed for scalable video search in a large multimedia database which works based on convolution neural network (CNN) learning framework. As the method uses supervised information based on deep learning network, there may exist ambiguity between the label information that can degrade the performance. Hao et al. [105] proposed an unsupervised hashing extension of stochastic multi-view hashing (SMVH) [106] through Student t-distribution matching scheme, the so-called t-USMVH and its extension of deep hashing through neural network called t-UDH. The aim of this method was also to increase the scalable search performance in large video databases. Hu and Lu [107] introduced a deep learning-based method in which the CNN [104] and recurrent neural network (RNN) were used jointly to achieve better copy detection accuracy. This method has overcome the limitations faced by the proposed methods in [108, 109]. In this method, the features or fingerprints were extracted initially from video frames using residual convolutional neural network (ResNet) and then Siamese Long Short-Term Memory (SiameseLSTM) architecture was trained for fusing both the spatial and temporal properties and sequence matching of video. Lastly, the graph-based neural network was used for identification of copied segments of a video. However, this method can incur high computational cost because of large number of trained datasets. Moreover, robustness against the geometric and content-preserving attacks is not analyzed properly.

Li et al. [110] proposed a parallel 3-dimensional convolutional neural network (3D-CNN) approach for video classification which relaxes 3D-CNN to two-class classification task from the multi-class classification task to reduce the data requirement on training. Features were extracted from the video input streams directly using 3D-CNN to obtain the local motion information from video. The parallel 3D-CNN classification model was built by a number of 3D-CNNs. As each 3D-CNN is a two-class video classifier, the number of 3D-CNNs is equal to the number of video classes. Finally, the decision was obtained by concatenating the classification results of all 3D-CNN classifiers. However, high computational cost can be incurred by this method which needs to be enhanced and robustness against the different distortions is not analyzed properly. The authors in [111] introduced a data-driven approach that uses deep neural network to learn robust video fingerprint or descriptor from a raw video. The task of learning video descriptor was broken down into subproblems, and then neural network was trained to tackle each of them by this proposed method. The conditional restricted Boltzmann machine (CRBM) was used as one of the prominent components for building deep feature learning network (conditional generative model) and was trained to capture the intrinsic visual characteristics as well as the spatiotemporal correlations among visual contents of video which were represented as an intermediate descriptor. A nonlinear encoder called denoising auto-encoder was then trained using pairs of intermediate descriptors extracted from manipulated and original videos to learn a compressed yet robust representation of intermediate descriptor. To preserve the optimal balance between robustness and discriminative capability of the output descriptor, the top layers of the network were trained. However, the challenge with this method lies in computational cost as training dataset will get increased by increasing the size of network. Nie et al. [112] came up with an idea that combined both the handcrafted visual features and semantic features of videos for near-duplicate video detection purpose. Firstly, low-level representation fingerprint (LRF) was generated from handcrafted visual features using a tensor-based approach which can preserve the mutual relations among various visual features. Secondly, CNN [104] approach was used to learn deep semantic features for generating deep representation fingerprint (DRF) to give heterogeneity assistance to LRF. This approach will also incur high computational cost which needs to be taken into consideration for better performance.

2.4.3 Miscellaneous approaches

Singh and Aggarwal [113] introduced a method to detect upscale-crop (frame-level) and splicing (region-level) forgeries that were performed using an image processing operation called resampling in digital videos. The detection operation of resampling artifacts (compression and noise) was carried out based on the pixel-covariance correlation and noise-inconsistency analysis whose outcomes are later combined to give better performance. The modified Gallagher (MG) detector and fractional modified Gallagher (F-MG) detector were used for the pixel-covariance correlation analysis. Analysis was performed using MG detector based on fast Fourier transform (FFT) and discrete cosine transform (DCT) domains, and the analysis was performed using F-MG detector based on discrete fractional Fourier transform (DFrFT) domain. Similarly, the noise-inconsistency analysis was performed based on wavelet denoising filter. For splicing (region-level) forgery detection, the methods were applied into regions of interest (ROIs) in video frames. The main challenge with this method lies in estimation of parameters used for analysis purpose such as scaling factors and interpolation filter for forgery detection. They came up again with a new idea of detection and localization of copy–paste forgeries [114], which alters the content of particular region of a frame in digital videos. Sensor pattern noise (SPN), Hausdorff distance-based clustering and color filter array (CFA) methods were used for copy–paste forgery detection and localization. As this approach considers the frame-to-frame and region-based matching, it can incur high computational complexity.

Multimodal visual–audio fingerprints-based video copy detection approach was proposed by Roopalakshmi et al. [115] in which both the visual and audio features were combined to detect illegal copies. Initially, the 1-D motion feature vector was generated by computing the average of differences between region-wise motion vector magnitudes of consecutive frames and the 1-D acoustic feature was generated using mel-frequency cepstral coefficients (MFCCs). In this approach, the DCT [62] was applied, in which the DCT coefficients of log powers of mel-frequency cepstrum (short-term power spectrum of a sound or audio) were considered for generating MFCCs. Secondly, sliding-window-based dynamic programming approach was applied to achieve accurate frame-to-frame matching. Subsequently, both the features were combined to generate a 1-D feature vector for copy detection. The performance of the proposed method was improved compared to the reference methods [116,117,118]. However, computational complexity can degrade the performance of this method.

A key parameter-dependent heat kernel signature (HKS)-based 3D model hashing was proposed in [119, 120] by Lee et al. This methodology was mainly developed for video authentication and is robust against the isometric modifications. The local and global HKS coefficients were obtained through timescales by computing the eigenvalues and eigenvectors of a mesh Laplace operator. Then, these HKS coefficients were clustered into 2D square cells with variable bin sizes and the feature values were extracted from the weighted distance of HKS coefficients based on n-order Butterworth function. The binary hash was generated through binarization of the intermediate hash values that were obtained by projecting the feature values onto random values. Further, to improve the robustness, uniqueness, security and spaciousness the two parameters called bin-center points and cell amplitudes were used. Choosing a robust key and parameters is the main challenge with this methodology. Many other methods [121,122,123,124] were introduced by several researchers, where they have explored the significance of visual hashing or fingerprinting in the field of video copy detection system.

3 Major challenges

The hashing- or fingerprinting-based copy detection is more preferable compared to watermarking-based copy detection for illegal video copy detection as multimedia content is often transformed or manipulated before being uploaded on the video sharing Web sites [1]. Still, there remain some challenges with the existing methods. Based on the works found in the state of the art, the main challenges that should be taken into consideration in order to enhance the performance of copy detection system are as follows:

  • To acquire a better trade-off between discriminability and robustness against the geometric as well as content-preserving distortions.

  • To lessen the computational cost of fingerprint extraction and matching.

  • To enhance the efficiency of fingerprint database search.

  • To lessen the storage space requirement for each fingerprint.

  • To incorporate the semantic concepts to lessen the semantic gap between the high-level and low-level feature representation of frame images.

  • To integrate fingerprinting-based and watermarking-based copy detection techniques in order to yield content identification as well as user authentication for high security.

Most of the video copy detection approaches are robust against the common content-preserving distortions such as contrast enhancement, blurring, frame rate change, frame resizing, etc., but robustness against the geometric distortions such as rotation, scaling, frame dropping, flipping and picture-in-picture still poses specific challenges to the problem of fingerprinting- or hashing-based video copy detection. The problem of copyright infringement of original video by an adversary still remains a big issue as the multimedia technology has increased. Many researchers are trying to find a solution for geometric distortions, and a large number of solutions are being proposed, but the robustness and percentage of detection accuracy are not up to the peak point, which needs to be enhanced further.

4 Current trends and discussion

Since the emergence of copyright infringement or piracy issues of multimedia, various approaches have been proposed by several researchers to tackle the issues. Fingerprinting-based copy detection approach has been adopted mostly because of its discriminability and robustness property compared to watermarking-based copy detection system [15]. Most of the existing methods are robust against the content-preserving distortions, so the researchers are currently working hard to achieve the high robustness against the geometric distortions, which is still a challenging task. Besides robustness, the discriminative capability also plays an important role in video copy detection system. It can be observed from the state of the art that there exists a trade-off between discriminability and robustness. So, currently the researchers are also working on acquiring a better trade-off between discriminability and robustness for optimal performance. Some state-of-the-art approaches have used local feature descriptors such as SIFT [19] for discriminability, while some others have used global feature descriptors such as OM [37] for robustness to common distortions and some have used combination of both the local and global feature descriptors to improve the performance. It can be seen that both the local and global features are extracted from the spatial domain where temporal domain is ignored, which is also an important property of a video. To overcome this limitation, some approaches have been proposed based on both the spatial and temporal domains such as TIRI-DCT [59]. Recently, the researchers have been focusing on the utilization of the importance of deep neural network-based learning approaches such as CNN [104] in the field of video copy detection. This approach incurs a high computational cost as it requires a large amount of database storage for pre-trained known dataset which increases as the network size increases. Moreover, there may exist an ambiguity between the label information which uses supervised information-based deep learning network. As this approach has recently been adopted by researchers for video copy detection, still there exists a huge scope for analyzing the shortcomings broadly and a fast and optimal solution for copy detection can be achieved in future (Table 1).

Table 1 Summary of the classification of some important existing hashing-based video copy detection methods based on different characteristics

How to choose a better copy detection method firmly depends on what we are seeking for and where we are seeking it. No universal description and no single approach seem to be optimal to various applications that require video copy detection. Some application cases for finding copies are identified below:

  • Finding exact copies in a stream for statistics on commercials.

  • Finding transformed full movie with possible decrease in quality (camcording) and no postproduction.

  • Finding short segments on TV stream with possible large postproduction transformation.

  • Finding short videos on the Internet with various transformations.

For the first case, local feature descriptors such as Harris detector [32] and SIFT [19] will work better for describing the precise interest point to detect the exact copies. For finding transformed full movie, as the length of video sequence is important, global feature descriptors such as OM [37] are probably efficient and faster than the local feature descriptors. For the third case, finding short segments in a video stream is a critical issue, and Harris detector [32] will probably give better result compared to global feature descriptor. For the fourth case, multiple difficulties are mixed for videos on the Internet and the solutions depend on the quality required. The method that combines both the local and global features which preserve both the spatial and temporal properties seems more promising for solving various transformations. It can be observed clearly that various distortions such as rotation, scaling, cropping, gamma correction, etc., are applied to the original video to get information by an adversary in a large extent. The choice of method is still open, but the combination of both the handcrafted visual features (local and global features based on spatiotemporal domain) and deep semantic features based on deep neural network constituting both the discriminability and robustness properties seems to be more promising for video classification and accurate detection of illegal copies. Almost all of the methods are being used by various video sharing Web sites such as YouTube, Netflix, etc., for copy detection purpose. Still they are facing large number of copyright infringement issues which need to be analyzed deeply and implement the robust method for fast and accurate copy detection.

All of the top methods in fingerprinting- or hashing-based video copy detection follow the paradigm of computing the compact signatures or hash codes from the content of a digital media without altering the content which is important for various multimedia applications. The generated compact hash or fingerprint can tell whether a dubious piece of content matches a multimedia document registered in the fingerprint database; thus, it can detect content replication of an original video robustly. Unlike watermarking approach, the fingerprinting approach can be applied to legacy content of a media that has already been distributed. The fingerprinting approach is more discriminative as well as robust against various content transformations compared to the watermarking approach.

5 Conclusion

The objective of this paper is to provide a detailed summary of the existing visual hashing- or fingerprinting-based video copy detection system. Most of the existing video copy detection methods were based on spatial, temporal and combination of both spatial and temporal domains according to the extracted features, and many other techniques have been used. Methods that were based on extracting the local features such as regions of interest (ROIs) points in spatial domain of a video have more discriminating power as compared to the other extracted features, but less robust against geometric attack. The methods considering only the spatial domain are not sufficient enough to survive the temporal attacks such as frame rate change when motion information is considered along the time or temporal axis of a video sequence. To solve the issues, many researchers have developed the methods that exploit both spatial and temporal information of a video. Still, most of the methods are not robust against both the content-preserving and geometric attacks as they were mainly focused on extracting the local features in grayscale form, where global features such as color information are ignored, which is also an important property when color pictures of a video come into play. There is a trade-off between discriminability and robustness properties in most of the existing methods. In recent decades, detection of copied or pirated version of an original video content has become more complex as multimedia technology has emerged tremendously. So, employing the methods that have both the discriminability and robustness properties against various content-preserving as well as geometric attacks such as lossy compression, resizing, rotation, scaling, etc., has become the most challenging in video copy detection system.

For tackling the problems and issues faced in video copy detection system, many researchers are currently working in this field and trying to improve the performance and efficiency of copy detection system based on the robust visual hashing or fingerprinting techniques.