1 Introduction

There is a huge number of digital images on the web and it is common to see multiple versions of the same image. The increasing use of low-cost imaging devices and the availability of large databases of digital photos and movies makes the retrieval of digital media a frequent activity for a number of applications. Creation, display and management of digital photos have been important activities in the digital life. People are accustomed to recording their daily life or journeys by digital cameras and share their living/travel experience on the web. Handling images on the Internet is very easy and convenient for users nowadays. This brings numerous security problems. Social media is one of the main areas where massive propagation of similar visual content takes place. The manipulation of images using computer techniques is not new and gained a lot of popularity and even acceptance in diverse areas such as forensic investigation, information technology, intelligence services, medical imaging, journalism, digital cinema, special effects in movies, etc. As more images are published on the web, modifying images is becoming increasingly easy with the help of powerful and user-friendly image manipulation software, and the move towards paperless workplaces and the introduction of e-Government services demand more data to be stored in digital format and more challenges are to be faced to securing authentic data. Unfortunately, document files, voice data and image data are all vulnerable to manipulation and doctoring. This makes the slightly altered image copies, which are termed near-duplicate (ND) images to their originals, difficult to detect. Proper feature extraction methods should be employed to overcome this challenge. Feature extraction is the most important step in near-duplicate detection. It is most critical because the features extracted directly influence the efficiency in the detection/retrieval.

The presence of near-duplicates (NDs) plays an important role in the performance degradation while integrating information from different sources. Web faces huge problems due to the existence of such NDs. By introducing efficient methods to detect and remove such NDs from the web not only decreases the computation time but also increases the relevancy of search results. The near duplicates must be clustered to avoid viewing the re-occurrence of the same image or variants of it resulting in efficient search engine results. Near-duplicate images arise generally when the same object is captured with the different viewpoint, illumination conditions, and different resolutions with zooming effects. These images are also called similar images because by definition [1], similar images are the images of the same object or scene viewed under different imaging conditions. Figure 1 shows an example of the near-duplicate image taken from the INRIA Copydays dataset [1]. The Fig. 1b is JPEG attacked image of Fig. 1a. Both look visually similar, but there exists difference.

Fig. 1
figure 1

Examples of near-duplicate images  from INRIA Copydays dataset

Near-duplicate images also arise in the real life when a particular object is captured at different instances of time when the background is moving as in panorama and burst shots. The examples taken from California ND data set [2] are shown in Fig. 2.

Fig. 2
figure 2

Examples of near-duplicate images from California ND dataset

Panorama photos capture the images with the horizontally elongated view and it is also known as wide format photography. Burst shots are useful for capturing the perfect moment when the objects in the image are moving.

Figure 3 shows the example of near-duplicate/forged/tampered images taken from the COVERAGE database [3]. In Fig. 3, a) is the original image and b) is the tampered version of a). Both look visually similar, but there exists difference. The portion marked in green color is chosen as the region of interest and the lighting of this region is modified (illumination transformation), translated to another similar object of the same image.

Fig. 3
figure 3

Example of near-duplicate image (visually similar)

Two NDIs may be perceptually identical, in which the differences comprise of noise editing operations, small photometric distortions, change in color balance, change in brightness, compression artifacts, contrast adjustment, rotation, cropping, filtering, scaling etc. Figure 4 shows two images taken from Toys dataset [4] as an example of near duplicates for rotation transformation. Figure 4b) is the rotated image of Fig. 4a).

Fig. 4
figure 4

Near-duplicate image created by rotation

Figure 5 shows the images taken from California ND dataset [2] as an example of near duplicate image captured from the same scene.

Fig. 5
figure 5

Near-duplicate images with viewpoint and illumination change

The presence of these ND images can be detected using Computer Vision techniques. Computer Vision deals with how computers can be made to gain high-level understanding from digital images or videos. It seeks to automate tasks that the human visual system can do. Computer vision tasks include methods for acquiring digital images, image processing and image analysis to reach an understanding of digital images. The problem in Computer Vision, image processing and machine vision is that of determining whether or not the image data contains specific object, feature or activity. In Computer Vision, a feature is a piece of information which is relevant for solving the computational task related to certain application. The feature concept is very general and the choice of features in a particular computer vision system may be highly dependent on the specific problem. Features are used as a starting point for many computer vision algorithm. Since features are used as the starting point and as it is considered as main primitive, the overall algorithm will only be as effective as its feature extraction. So, this survey concentrates mainly on the feature extraction and how it is efficiently used in the detection of NDIs.

1.1 Motivation

Detecting near duplicate images (NDIs) in large databases should meet two challenging constraints—i) For a given image, only a small amount of data can be stored in web to reduce the search complexity (for e.g. fingerprint) and ii) Queries must be very simple to evaluate [5]. Image retrieval from large databases is a fundamental issue for forensic tasks. Forensic tasks include collecting, identifying, classifying and analyzing physical evidence related to criminal investigations. Since the images on the web are retrieved easily, it becomes easy for the web users to manipulate many images as near duplicates. Mainly the images that are used as physical evidence in the criminal investigation are manipulated. The presence of NDs plays an important role in the performance degradation while integrating information from different sources. Web faces huge problems due to the existence of such NDs. This increases the index storage space and thereby increases the serving cost. By introducing efficient methods to detect and remove such NDs from the web search not only decreases the computation time but also increases the relevancy of search results. Some of the applications of near-duplicate detection (NDD) include integrating TV news telecasted on different days, integrating information from various letter sorting machines for postal automation, etc.

Landge Amruta and Pranoti Mane reviewed the development of several image matching algorithms [6]. Several mathematical formulae are discussed based on which the similarity between two images is computed. But, the review does not concentrate on the image matching algorithm that best suits the near-duplicate identification. The state-of-art review paper in copy-move forgery detection concentrates only on one part of the near-duplicate image detection i.e. detecting forged images using image editing software [7]. As forged images are the subset of ND images and they concentrate on detecting the type of forgery made, the datasets requirements to evaluate these methods will be slightly different. Evaggelos Spyrou and Phivos Mylonas made a survey on the Flickr social network highlighting the current progress, innovations, and the evaluation methods [8]. If any two images look similar or convey similar concepts to you they are near-duplicates. So, near-duplicate detection is a highly subjective process. An algorithm which is producing good results for the ND images created using editing software need not produce the same result for the ND images captured using a camera in real life. For example, an artificial zooming effect created using software will not be exactly same as that of the effect produced by zooming lens of a camera and it will not be accurate because while zooming with the camera, there may be slight changes in the scene. So to fill with this gap any algorithm developed for NDD should also be evaluated using datasets created from the real-life scenario. NDD is a broader area of research.

The remainder of this paper is organized as follows. Section 2 reviews the general methods used for the detection of near duplicates. The feature extraction methods for NDD images are reviewed in Sect. 3. In Sect. 4, the methods used in ND retrieval of images are presented. NDD used in image forensics are mentioned in Sect. 5. Section 6 discusses the clustering methods used for the NDD. Some of the databases used for the evaluation of NDD are explained in Sect. 7 and the applications of NDD images are listed in Sect. 8. The Sect. 9 points to the challenges and future research directions and finally, Sect. 10 concludes this paper.

2 General Method Used for the Detection of Near Duplicates

The steps followed in the detection of near-duplicate images are generally quite similar to that of traditional image matching method. Figure 6 shows the general block diagram for the identification of near duplicate images. Initially, the features of the images (image descriptors), retrieved from the web or in the database, are extracted and stored in another database named ‘Feature Vector Database’. The features of the query image are extracted and compared with the feature vectors of all images already stored in the database. If matches are found, they are marked as near-duplicate/visually similar images. This approach requires an efficient indexing structure to search for similar descriptors in the database. The percentage of similarity required is fixed by a reference value. Some of the methods used for the detection of near-duplicate images are presented in the next section.

Fig. 6
figure 6

General block diagram for identifying near-duplicate images

3 Feature Extraction Methods for NDD Images

Generally features extraction methods can be broadly classified as low-level and high-level features or local and global features. In NDD it can be classified as point-based feature extraction, pixel-based feature extraction and area-based feature extraction methods.

If the features are extracted based on the keypoints or interest point, then those methods are classified as point-based feature extraction methods. If the features are calculated at each and every pixel, then it is referred as pixel-based feature extraction method. Area-based feature extraction methods are the methods in which the features are calculated over the entire image or just regular sub-area of an image. This classification is shown in Fig. 7.

Fig. 7
figure 7

Classification of feature extraction methods in NDD images

3.1 Keypoint-Based Feature Extraction Methods

The state of art near-duplicate image/video detection methods requires proper selection of image descriptors (features) to represent the images precisely. For finding copyright violations and detecting forged images, Ke et al. presented an efficient near- duplicate detection method by using robust interest point detector (Difference of Gaussian, DoG) and distinctive local descriptors (Principal Component Analysis – Scale Invariant Feature Transform, PCA-SIFT) [9]. They used locality sensitive hashing (LSH) to index these local descriptors. But, the system matches the similar images of the same scene even if they are neither near duplicates nor forged images. Chen Li and Fred Stentiford presented an attention-based similarity measure which is generated based on a trial and error basis [10]. The SIFT interest points are very effective to geometric variations in the image but scalability is a big issue due to a large number of features generated. To overcome this issue, Foo and Sinha pruned SIFT interest points to reduce the memory and query run-time with negligible loss in effectiveness [11]. To reduce the number of SIFT interest keypoints, the threshold is varied to discard candidate local peaks. They also pruned PCA-SIFT features. A modified Redundant Bit Vector (RBV) index is introduced by Foo and Sinha for a high-dimensionality search of multimedia data and is efficient for audio fingerprint detection [12]. RBV gained efficiency and a reduction in memory requirement in comparison to LSH.

SURF (Speeded Up Robust Features) is a scale and rotation-invariant detector [13]. It is used for classification tasks and is faster to compute. It is suited for object detection, object recognition or image retrieval. Hessian matrix approximation is used for interest point detection. SURF outperforms SIFT. This is due to the fact that SURF integrates the gradient information within a sub patch, whereas SIFT depends on the orientation of the individual gradients. This makes SURF less sensitive to noise. Another feature ORB (Oriented FAST (Features from Accelerated Segment Test) and Rotated BRIEF (Binary Robust Independent Elementary Features)) is a rotation invariant and resistant noise descriptor [14]. ORB is a combination of oFAST and rBRIEF. BRIEF is a feature descriptor that uses simple binary tests between pixels in a smoothed image patch. Its performance is similar to SIFT in many respects, including robustness to lighting, blur and perspective distortion. However, it is very sensitive to in-planar rotation. FAST features are widely used because of their computational properties. However, FAST features do not have an orientation component. The main drawback in ORB is that scale invariance is not addressed.

SIFT not only has good scale and brightness invariance but also has a certain robustness to affine distortion, perspective change, and additive noise. However, to extract SIFT features for representing an image, hundreds or even thousands of SIFT key points need to be selected. Each key point uses a 128-dimensional feature vector to explain the features. Thus, the matching value of detection methodology supported SIFT options is high. Cao et al. experimented with ASIFT to detect near duplicates [15]. It was found that it showed better performance than Hessian-Affine and MSER (Maximally Stable Extremal Region). ASIFT (Affine Scale Invariant Feature Transform) manages effectively all the parameters of the affine transform [16]. Lindeberg presented a theoretical basis for computing scale invariant image features and image descriptors for a wide range of possible applications in computer vision [17]. Wang et al. presented a keypoint based approach for near duplicate image detection [18]. This method works with fewer keypoints matching, i.e., it can detect reliable and salient keypoints and these keypoints are matched with more strong constraints. The false alarm is reduced as more specific details are not considered to enhance the color matching.

Tareen and Saleem analysed SIFT, SURF, KAZE, AKAZE, ORB and BRISK features [19]. They concluded the following - SIFT is invariant to image rotation, scale and limited affine variations; its main drawback is high computational cost. SURF features are invariant to rotation and scale but they have little affine invariance. The main advantage of SURF over SIFT is its computational cost. KAZE features are invariant to rotation, scale, limited affine and have more distinctiveness at varying scales with the cost of moderate increase in computational time. AKAZE Accelerated KAZE. AKAZE features are invariant to scale, rotation, limited affine and have more distinctiveness at varying scales because of non-linear scale spaces. ORB features are invariant to scale, rotation and limited affine changes. BRISK (Binary Robust Invariant Scalable Keypoints) features are invariant to scale, rotation and limited affine changes. The overall accuracy of SIFT and BRISK is found to be highest for all types of geometric transformations and SIFT is concluded as the most accurate algorithm. BRISK is at second position with respect to the accuracy for scale and rotation changes. ORB is less accurate than BRISK for scale and rotation changes but both are comparable for affine changes. AKAZE is more accurate than KAZE for scale and affine variations. However for rotation changes KAZE is more accurate than AKAZE. The advantages, disadvantages and suitability of point-based features are tabulated in Table 1.

Table 1 Summary of features in point-based feature extraction methods

3.2 Pixel-Based Feature Extraction Methods

Foo et al. presented detection of ND images in the web search [20]. The color and texture features are extracted using Dynamic Partial Functions (DPF) and PCA-SIFT local descriptors. They used hash based probabilistic counting (HPC) for automatic detection of ND images. With the fast development of RISC (Reduced Instruction Set Computer) processor, camera, display technology and wireless networks, the mobile devices have become more powerful, ubiquitous and important to users. A large-scale partial-duplicate image search system for mobile platform was developed in [21]. Compared with traditional methods, image search on mobile platform needs to take more factors into consideration, e.g., the limited computational resource, storage and memory capacity, relatively expensive wireless data transmission, rich sensors like GPS, accelerometer, gyroscope etc. To achieve efficient and accurate partial-duplicate mobile search, they focussed on extracting compact, discriminative, and efficient local descriptor which is commonly known as a basis for Bag-of-visual words (BOWs) representation. Edge-SIFT is partially based on SIFT but is more discriminative, compact and suitable for mobile partial-duplicate image search. Edge-SIFT recorded the locations and orientations of edge pixels, thus preserved rich spatial clues and imposed more strict restrictions on feature matching.

Pulse Coupled Neural Network (PCNN) is used in the ND image detection. It is used for feature extraction process as it is simple and extracts reduced number of features. Many researchers work on PCNN for its easy implementation, higher recognition, anti-noise disturbance, robustness, etc. PCNN is unsupervised and self-organized network. The PCNN and its numerous variations have been found to be useful in a wide variety of applications. The mathematical model using discrete Fourier transform on the global pulse signal of the PCNN is described by Raul C. Muresan [22] and analysed the pulse of the network to achieve scale and translation independent recognition of an image. The system is used for recognising simple geometric shapes and letters. Yu and Le presented a novel feature extraction method for image processing via PCNN with Tsallis entropy [23]. This feature is translation and scale independent, while rotation is a bit weak at diagonal angles of 45º and 135º caused by pixel discretization. It is a new and potential approach and can be applied to all sorts of recognition fields. The drawbacks encountered are low classification rate and need to simplify the procedures in this model.

Yide Ma et al. extracted image features as time signals of PCNN [24]. These signals are invariant to larger changes in rotation, scale and shift or skew of the input image. They compute Mean Square Error (MSE) between the feature vectors to decide the kind of image group that the input image belongs. This method is strongly flexible to resist noises and greatly robust to recognise the image. Feature extraction using Unit-linking PCNN was proposed to generate the global and the local image icons as image features [25]. He included not only the intensities or color information but also the geometry structures or color distribution of the images. The global unit-linking PCNN icon is translation, rotation and scale invariant. Local unit-linking PCNN image icon can authenticate images correctly. For a complicated object detection system, the global unit-linking image icon method should work together with other methods to improve the detection performance. Radoslav Forgac and Igor Mokris improved the feature generation using an optimised PCNN [26]. Their main aim was to reduce the number of generated features and to determine the optimum number of iterations. This reduced feature is considered as a feature with maximal information value for image recognition process. Trong-Thuc et al. presented Ram-based and pipeline-based hardware architectures for PCNN for real-time image feature extraction [27]. The advantages of these models are anti-noise and invariant against geometrical changes. Mona Mahrous Mohammed et al. proposed an optimized PCNN that extracts the visual features of the images as signatures and uses those features for classification [28]. The key feature of using image signature is its simple and small representation. This signature is represented as a vector that contains the number of firing neurons over several iterations. An et al. proposed a graph matching with geometric constraints for near duplicate image retrieval, which explored the spatial relationship in image patches [29]. Also, introduced geometric constraints to remove the false matches and yielded good retrieval results. It has complex run-time and the index structure is not efficient. Along with PCNN, CNN (Convolutional Neural Network) is also used in the near duplicate detection. Zhang et al. presented an image pair-wise similarity joining learning function without any handcrafted features [30]. Convolutional neural network is used to encode the features and classification. This method processes the raw images directly without human designed feature extraction and the images are classified. Compared to conventional approaches, this method eliminated the complex handcrafted features extraction and processed images jointly.

Vonikakis et al. presented non-linear version of the DoG detector (nLDoG-1) [31]. The nLDoG-1 exhibited increased sensitivity to low contrast and provided detection robustness in low local contrast occasion. The nLDoG-1 performed better than the other existing detectors in terms of repeatability score and number of corresponding regions in sequences with either uniform or non-uniform illumination. It is computationally inexpensive and required fewer memory resources. The detector is suitable for mobile robotics applications and for outdoor or space scenarios, where severe illumination changes occur. This detector is applicable only for uniform illumination changes and not for non-uniform illumination changes. Also, it cannot be used in realistic lighting condition. The same authors also presented an illumination invariant operator by combining the non-linear characteristics of biological center-surround cells with the classic difference of Gaussians operator [32]. The advantage is keypoints are detected in both dark and bright image regions. It is appropriate for outdoor vision systems working in environments under uncontrolled illumination conditions. Zhuang et al. proposed two methods Intensity Difference Quantization and Weakly Spatial Context Coding to improve the discriminative power and robustness of binary descriptors [33].

The advantages, disadvantages and suitability of pixel-based features are tabulated in Table 2.

Table 2 Summary of Features in Pixel-based Feature Extraction Method

3.3 Area-Based Feature Extraction Methods

Wu et al. presented query oriented subspace shifting algorithm [34]. The algorithm measures the similarity in various subspaces that are dynamically generated based on the correlation between samples and the query image. The near duplicates are never missed while detecting as these subspaces are query oriented. But, their method is not efficient for large-scale NDD. Chum et al. proposed a near-duplicate image detection method using the min-Hash algorithm [35]. The min Hash method stored only a small constant amount of data per page and a complexity for duplicate enumeration that is close to linear in the number of duplicates returned. A set of weighted visual words are extracted and each visual word is assigned a weight. The weighted histogram intersection is the best similarity measure in both retrieval quality and searches efficiency. The main drawback is some relevant information is not presented in the set of visual words representation. Chun et al. utilized contrast context histogram for image matching and recognition [36]. Their experiments demonstrated that contrast-based local descriptors represented local regions with more compact histogram bins. Their method is used in many real-time applications as it has high matching accuracy and efficient computation.

Herve Jegou et al. proposed to use Hamming embedding (HE) and weak geometric consistency constraints (WGC) descriptors for image matching and searching in large image datasets [1]. The HE provides binary signatures to refine image matching and WGC filters the descriptors that are invariant to rotation and scaling transformations. Pattern Entropy (PE) is a measure to evaluate the information of being a near duplicate pair. The PE captures the matching patterns with two histograms of matching orientations. PE is excellent for characterizing regions with parallel-like matching lines, but not for regions undergone considerable scale and rotation changes. For a more robust pattern evaluation, scale rotation pattern entropy (SR-PE) is introduced by Zhao and NGO [37]. SR-PE gets a lower entropy value indicating the perfect match only when the ND region pair shows homogeneity across channels. SR-PE is computationally efficient. Compared to PE, SR-PE is capable of evaluating complex patterns composed of ND regions under the unknown scale and rotation changes. SR-PE is not effective when the viewpoint changes and it cannot be used to infer higher-level semantics by localizing object and background duplicates. Battiato et al. exploited the coherence between feature spaces in the image representation and in the codebooks generation to improve the bags of visual phrases [38]. This is achieved through the alignment of the feature space partitions obtained from independent clustering.

Sluzek et al. described the detection and segmentation of ND fragments in random images [39]. Random images are the images with unpredictable content. Such fragments usually represent the identical object, though captured from a different viewpoint, under different photometric conditions and/or by a different camera. Affine transformations poorly model ND fragments for objects with strongly non-planar surfaces or objects which are naturally deformable. For the retrieval of such near duplicates, an alternative algorithm is proposed by Sluzek based on the point-based topological constraints. Topological constraints provide high performances in matching non-linear ND fragments at rather low computational costs. Cho et al. proposed a concentric circle-based image signature for NDD [40]. The image is partitioned by radius and angle levels from the center of the image. The feature values are calculated using the average or variation between the partitioned sub-regions and are formed into an image signature by hash generation. The hashing facilitates storage space reduction and quick matching. The hashed bits are more stable and robust to image modifications. But encountered difficulty in geometrical modifications such as noise addition and blurring which change the center.

The traditional near-duplicate image search systems [41,42,43,44,45] mostly were built on the bag-of-local features (BOF) representation as it is favourable for simplicity and scalability. But it has certain shortcomings – high time complexity of the local feature detection, discriminability reduction of local descriptors due to BOF quantization and neglects the geometric relationships among local features after BOF representation. To overcome these shortcomings, Xie et al. proposed a framework by using graphics processing units (GPU) [46]. The computational capability of GPU is much higher than that of a central processing unit. Due to its powerful capability, the GPU nowadays serves not only for graphics display, but also for general purpose computation such as molecular dynamics, image processing, and machine learning and computer vision. Image matching is one of the key technologies for many visual-based applications including template matching, block motion estimation, video compression, stereo vision, image/video NDD, the similarity join for image/video database and so on. Normalized cross correlation (NCC) is one of the widely used methods for image matching with preferable characteristics such as robustness to intensity offsets and contrast changes, but it is computationally expensive. Satoh proposed features, derived by the method of Lagrange multipliers [47]. By using these features, NCC-based image matching is effectively accelerated. It applies image NDD to 2 billion images in the web for image auto-annotation.

Das et al. proposed the term document weight matrix approach with three phases such as rendering, filtering and verification for finding near duplicates of an input web page from a huge repository [48]. This approach explores the semantic structure, content and context of a web page rather than the content only approach. Accuracy is improved further by applying this method after identifying the main content blocks rather than the entire web page. Dong et al. achieved efficiency and quality in [49] by making three contributions – i) the non-discriminative features are filtered out using entropy-based filtering method and with the retained high-quality features, a single match is sufficient for claiming an ND relationship between images with high confidence. This reduced the number of SIFT features to be indexed, ii) A query expansion method based on graph cut is used to reduce the false positive rate by improving recall, iii) This system is capable of indexing more than 50 million web images on a commodity server and it can return search results in less than two seconds.

Bueno et al. proposed the Bayesian approach for near-duplicate image detection and investigated the performance in different probabilistic models [50]. The task of identifying an image whose metadata are missing is applied in different applications such as metadata retrieval in cultural institutions, detection of copyright violations, investigation of latent cross-links in archives and libraries, duplicate elimination in storage management, etc. The increasing interest in archiving all of humankind’s cultural artifacts has resulted within the digitization of lots of books. Most of the data found in historical manuscripts are mainly text, but with a few number of images. Rakthanmanon et al. introduced a scalable system that can detect approximately repeated occurrences of shape patterns both within and between historical texts [51]. This ability to find repeated shapes allowed automatic annotation of manuscripts and allowed users to trace the evolution of ideas. Li et al. proposed a scheme by combining the neighbourhood information of single local feature and the global geometric consistency of multi-local features for improving the accuracy of BOW model [52]. First, the geometric contextual information of image local features is constructed to enhance the distinctiveness of visual word. Then, the global geometric consistency of subset-of-features is verified for improving the accuracy of retrieval results. Li and Feng proposed an approach to detect near-duplicate images automatically based on visual word model [53]. SIFT descriptors are utilized to represent image visual content to detect local features of images. NDI detecting process is implemented by histogram distance computing.

In the traditional methods, the Bag-of-Visual Words (BOVW) model and the inverted index structure are used for image matching. Despite the simplicity, efficiency and scalability, these algorithms highly depend on the accurate matching of local features. This suffered from unsatisfied precision and recall. Xie et al. investigated the re-ranking problem from a graph-based perspective and proposed an efficient data structure called ImageWeb to discover the high-level relationships of images [54]. A tradeoff strategy is provided to guide the parameter selection in the online searching process. By sacrificing the initial search accuracy, better search performance is achieved with much lower time complexity compared to the baseline methods and their algorithm is highly scalable. Nemirovskiy and Stoyanov suggested a search pattern based on the rank distributions of the cluster cardinality [55]. Multi-step segmentation performed by means of the recurrent neural network created an image pattern, based on the rank distribution of the brightness cluster cardinalities. The recognition based on the rank distribution allows determining the ND images in grayscale up to the radius of the Gaussian distortion on them. Lei et al. used Radon transform in content-based image representation as it has excellent geometric properties [56]. A family of geometric invariant features based on Radon transform is proposed for near duplicate image detection. NDID methods achieve their goals by measuring the similarity between the features of the query and the target.

An image is represented by a signature and its length gets varied based on the number of patches in the image. Li et al. proposed a visual descriptor named probabilistic center-symmetric local binary pattern (PCSLBP) to depict the patch appearance [57]. Beyond each individual patch, the relationships among the patches are described. A weight is assigned to each patch to identify the image. Given the characteristics of all the patches, the image is represented by a signature. For similarity computation, earth mover’s distance is used as it handles variable-length signatures. The patch extraction instability is addressed by allowing many-to-many patch correspondence. Lingxi et al. focussed on fine-grained image search. Fine-grained image search resulted in images that contain the same fine-grained concept with the query [58]. The locality sensitive hashing (LSH) based methods will not generate the buckets of similar sizes, which will reduce the detection effectiveness and efficiency. Yabo et al. proposed load balanced LSH (LBLSH) to produce load-balanced buckets for the hashing process [59]. It reduces the query time and the storage space. Rituparna and Scott evaluated similarity between two images aided by a salient object detection framework [60]. The objective was achieved by incorporating a salient object detection scheme in a sparse representation-based dictionary learning framework. The key aspect in obtaining a more robust similarity measure between images is to extract more meaningful features from the image.

Kim et al. studied near-duplicate image discovery on one billion images which is easily implemented on MapReduce framework [61]. They introduced the seed growing step designed to effectively reduce the number of false positives among cluster seeds. NDI discovery is to detect all clusters composed of images which duplicate at significant regions. Haoran et al. proposed an invariant multi-scale shape descriptor for shape matching and object recognition [62]. This shape feature is invariant to translation, rotation, scaling and can tolerate partial occlusion, articulated variation and intra-class variations. The advantages, disadvantages and suitability of area-based features are tabulated in Table 3.

Table 3 Summary of Features in Area-based Feature Extraction Method

4 Methods Used in Near-Duplicate Retrieval of Images

ND image retrieval is to find all the images that are near duplicates to a particular query. A major challenge in the image retrieval is building effective features that are invariant to a wide range of variations. Near-duplicate retrieval is used for the management of multimedia contents. Search speed is a key factor to judge any near duplicate retrieval algorithm. Not only the retrieval time but also maintaining the relevance of returned results is important in image retrieval. The methods used in the retrieval of near-duplicate images are presented in this section.

Hu et al. proposed a coherent phrase model (CPM) which used the visual phrase of multiple descriptors to characterize every local region to enforce local coherency [63]. It is different from standard Bag of Words (BoW) representation. The two types of phrases proposed are Feature coherent phrase (FCP) and Spatial coherent phrase (SCP). In FCP, every local region is characterized by multiple descriptors of different types. The match of two local regions requires the coherency across different types of features. In SCP, multiple descriptors of a single type of feature are generated between every local region. The match of two regions requires the coherency across different spatial neighbourhoods. Although this model is simple, it improves the matching accuracy by reducing the false matches and additional cost existed for the extraction of multiple features.

The traditional visual vocabulary is created in an unsupervised way by clustering a large number of local features. This is not ideal because it largely ignored the semantic and spatial contexts between local features. Shiliang et al. combined geometric and semantic contexts to generate contextual visual vocabulary [64]. The contextual visual vocabulary is more expensive as the distance computation is time-consuming. Two or more features are combined with other features in the detection of near duplicates. This is known as bundling of features. Bundled features are much more discriminative than a single feature. Wu et al. improved the bundled features with an affine invariant geometric constraint for the retrieval of partial duplicate images [65]. It employs area ratio invariance property of affine transformation to create the affine invariant matrix for bundled visual words. Such affine invariant geometric constraints cope well with flip, rotation or other transformation. The two main strategies for IND retrieval are global feature-based retrieval and local feature-based retrieval. Global features such as color moment and color histogram, although efficient, are sensitive to occlusions and illumination changes. The local region features are robust to lighting, viewpoint and scale changes. However, such methods still suffer from heavy computational cost. Cheng et al. [66] explored local features in spatial-scale space (S-cube) and then formed a global representation for each image. Local features made the model robust to illumination variations, viewpoint and scale changes, as well as geometric transformations, while the final global representation allowed an efficient online retrieval. S-cube has a fast retrieval speed while a high accuracy is preserved. S-cube representation incorporated not only the appearance feature information but also the spatial and scale co-occurrence information. S-cube method is invariant to spatial information and a wider range of scale changes.

Tong et al. proposed a statistical framework for large-scale near-duplicate image retrieval using kernel density function [67]. Each image is represented by a kernel density function and the similarity between the query image and a database image is then estimated as the query likelihood. Zhou et al. [68] proposed a matching verification scheme based on BSIFT (binary SIFT) signature. Using the BSIFT, the precision of large-scale image search can be improved by identifying and removing the false-positive matches. Paradowski et al. addressed the problem of large-scale near-duplicate image retrieval (NDIR) and proposed a new spatial verification routine [69]. It incorporates neighbourhood consistency, term weighting and it is integrated into the Bhattacharyya coefficient. They give better retrieval quality. The state of the art of technology for NDIR is mostly based on the Bag-of-Visual Words model. However, visual words are straightforward to end in mismatches because of quantization errors of the local features. In order to improve the precision of visual words matching, contextual descriptors are designed to strengthen their discriminative power and measure the contextual similarity of visual words [70]. The contextual descriptor measured the contextual similarity of visual words to discard the mismatches and to reduce the count of candidate images. This descriptor increased the discriminative power of visual words and is robust to image editing operators, such as rotation, scaling and cropping. But it is not robust to perspective transformation of image. It does not work as well on general object retrieval.

Near-duplicate documents are the documents that are created from the original document with some changes. ND document images refer to the images captured from the same document but under different imaging conditions. Document images are the images/scanned copies of documents that may be handwritten or typed documents. In document images, NDD may be used to increase the efficiency of tagging the documents by reducing the need for manual inspection of documents. The entire document (including the text) is considered as image. Document image detection is used in postal automations and digital libraries. Some of the methods used for the detection of near-duplicate document images are presented in this section.

Shiv Vitaladevuni et al. presented an approach to detect near-duplicate document images using SIFT interest point matching [71]. From the set of document images, a database is constructed using the SIFT features extracted from each image. The near duplicates of a query are computed by directly matching the SIFT descriptors with the feature database. Interest point based matching is used for hand and machine written document images for the NDD in Arabic documents in which OCR and text segmentation is difficult. In this approach, 80% of the near duplicate documents are retrieved with low false accepts. Liu et al. proposed document image matching characterized by a graphical perspective [72]. Document images are represented by graphs whose nodes correspond to the objects in the images.

5 Near-Duplicate Detection in Image Forensics

Copy-move is one among the most common techniques used for image forgery. In this kind of forgery, one or additional objects in an image are hidden by copying a part and moving it to another place of the same image. Some advanced image editing tools make this type of forgery undetectable by applying a ‘soft’ touch at the edges of the moved part. As the color and texture of the moved part is compatible with those of the copied part, it is very difficult to distinguish between those two parts. Also, two or more identical objects in the same original image make the forgery detection difficult. In copy-move forgery detection, the best performing methods are based on the matching of highly discriminative local features [73]. The discriminability of local features is very less and also, the BOW quantization errors lead to many false local matches that made very difficult to differentiate similar images from copies. Geometric consistency verification reduced the false matches but it neglected the global context information of local features and thus this problem is not solved well. To address this problem, Zhou et al. proposed a global context verification scheme to filter false matches for copy detection [74]. First initial SIFT matches between images are obtained and then the overlapping region-based global context descriptor (OR-GCD) is used for verification of these matches to filter false matches. The OR-GCD has good robustness and efficiency.

Rimba et al. proposed a different kind of scheme by exploiting a group of similar images to verify the source of tampering [75]. By finding the correlation between the reference images and the suspicious image allowed discovering the source of the tampered region. This method relies on the presence of edges. So, the images with no edges are not detected. If the size between the target image and its references are different, then the correlation will be low. Also, if the tampered regions have undergone transformations, then the correlation values will be below the threshold. Irene Amerini et al. presented a SIFT-based forensic method for copy–move attack detection and transformation recovery but it does not investigate the cloned image patch [76]. Ghulam Muhammad et al. proposed a blind copy move image forgery detection method using undecimated dyadic wavelet transform (DyWT) and it is shift invariant [77]. The existing copy-move forgery detection methods either rely on similarity measurements or noise deviation measurements between the parts of an image. Vincent et al. created copy-move forgery by copying and pasting content within the same image [78]. Here, the detection performance is analyzed on the per-image basis and on the per-pixel basis.

I-Chang et al. proposed a forgery detection algorithm to recognize tampered inpainting images [79]. An inpainting image is constructed by filling the target area using the surrounding regions in the same image. Therefore, the key problem is recognizing the fake region and search for similar regions in the faked image. This approach recognized the forged image effectively and identified the forged regions even for the images that have uniform background. In addition to the inpainting type, copy-move forgeries are also detected by this method. But the limitation is in locating forged regions of small sizes. Also, if the forged region is from different sources this method may fail.

6 Clustering Methods Used for the NDD

The main aim of clustering is to improve search quality and save storage space. Clustering the web image search engines results is very essential to help users narrow their search. The clustered results are the refined results of the image search. These resultant images will be relevant to the search query. Clustering the near duplicates arising from the results of web image search mainly depends on image features. Image features help to uniquely identify the image from the large dataset. While image search engines return a long list of images, the user may find it difficult to choose his or her interested areas. Generally, the image search results contain multiple topics. Organizing the results into different clusters facilitates users’ browsing. Foo and Sinha [11] clustered near duplicate images by combining techniques from invariant image local descriptors and an adaptation of near duplicate text-document clustering techniques. The geometric clue among visual words in an image is computationally expensive and are not considered for clustering.

Some of the online resources such as news and weblogs are used to extract articles and comments are posted related to a popular event. If articles have common parts, then the content of such article is event-relevant leading to near duplicates. Chang et al. proposed an NDD method for finding event-relevant content on the web [81]. The near duplicates related to the same event are checked based on the compact feature of the document. Based on the features, the documents are clustered into event relevant groups. Story clustering is a critical step for news retrieval, topic tracking and summarization. With the overwhelming volume of news videos available today, it becomes necessary to track the development of news stories from different channels, mine their dependencies and organize them in a semantic way. The mining of topic-related stories through clustering on the basis of visual constraints is built on top of text by Wu et al. [82]. Two stories sharing at least one pair of NDK are always placed into the same cluster. Constant-driven co-clustering algorithm is employed for mining news topics of varying densities, shapes and sizes. Constraints constructed the association of stories and keyframes; and initialized cluster centers. Also, the retrieval radius is dynamically adjusted to density-reachable stories, which reduce the burden of parameter selection and provide important clues for grouping stories. Constraints may also be outliers and noisy due to the errors in automatic NDK detection.

The ASIFT proposed by Morel et al. viewed the images in different angles by varying the two camera axis orientation parameters, namely, the latitude and the longitude angles, which cannot be done using the SIFT method [83]. This can be done with no computational load. ASIFT outperforms SIFT, maximally stable extremal region (MSER) method, Harris-affine, and Hessian-affine. To reduce the computational cost in partial-duplicate web image search, image features are bundled into local groups. Wu et al. proposed a bundled approach to find the near duplicate images [84]. The SIFT and MSER features were bundled. MSER is a region-based approach, while SIFT is a point feature detection approach. Each group of bundled features becomes more discriminative than a single feature. Kalaiarasi and Thyagharajan clustered the images by bundling the ASIFT and wavelet features of the images [85]. The input to the framework is the results retrieved from the web image search engine. The bundled feature matrix has the relationship among the images with respect to the ASIFT and wavelet feature. Clustering is done using the bundled feature matrix and k-means clustering. Current image search systems use paged image list to display search results. But, the query ambiguity is hard to find search targets in such image lists. The clustering algorithm developed by Ponitz Thomas and Julian Stottinger has a linear runtime and can be carried out in parallel [86]. Features of images with natural repetitive texture become similar to other images and displayed in almost all the search results. This problem is addressed by an asymmetric Hamming distance measurement for bags of visual words. It allows better discrimination power and robust to image transformations such as rotation, cropping, or change of resolution and size.

Zha et al. proposed a group recommendation framework for social groups to share photos [87]. In that framework, pre-built group classifiers predict the group of each photo present in the user’s photo collection. Also, representative photos are selected from the collection and accordingly grouping of photos is done. Kalaiarasi and Thyagharajan detected and clustered the ND images based on the visual content present. The supervised and unsupervised clustering is used to form cluster of image [88]. Each cluster would have one image as a representative of that cluster and the other images are called its near duplicates. Clustering of image search results and selection of representative images on large scale image data is done [89]. The system developed by them can organize image search results and help users browse images and find search target. This system applies MapReduce-based image graph construction method and affinity propagation algorithm to generate image clusters and representative images for large-scale image data. This approach does not combine visual and textual image graph. This image grouping system results in the images belonging to multiple clusters. It is an unresolved problem in this work. Partial duplicate image discovery/clustering is defined as finding all the images containing the same objects from a large dataset. The challenge of partial duplicate web image discovery/clustering is that the target images may contain many variations, such as color, scale, illumination; as well as viewpoint changes, partial occlusion, near-duplicate and affine transformations. Zhang and Qiu improved the performance of large scale partial duplicate image discovery and clustering by encoding the spatial geometric information of bag of visual words (BOW) [90].

7 Data Bases Used for NDD Evaluation

The widely used databases to evaluate the near-duplicate detection of images are discussed in this section.

Corel Photo CD Collection: [91]

Each image under this set is altered using the following list of transformations – format change (1), colorsize (3), contrast (2), severe contrast (2), crop (4), severe crop (3), despeckle (1), frame (4), rotate (3), scale-up (3), scale-down (3), saturation (5), intensity (4), severe intensity (2), rotate + crop (3), rotate + scale (3), shear (4). This results in 50 alterations [11, 80].

MIRFlickr1 M and MIRFlickr60K: [92]

MIRFlickr1M has 1 million distractor images mostly related to holidays of humans. It also includes low resolution images. MIRFlickr60K has 67,714 images.

MIRFLICKR-1M and 2008 - MIRFLICKR25000 [93]: MIRFlickr25000 has 25000 images and MIRFlickr1M has one million distractor images which are used to evaluate large scale image search. Compared to INRIA, the Flickr datasets are slightly biased, because they include low resolution images and more photos of humans.

Oxford5K Dataset: [94]

The dataset has 5062 images of Oxford buildings collected from Flickr by searching Oxford landmarks. [45] introduce a novel quantization method based on randomized trees and evaluate it using Oxford dataset. They show that the quality of image retrieval depends on the quantization.

MM270K Dataset: [95]

This set contains about 18,000 images with very diverse content, such as animals, landscapes, people, etc. These images have undergone 40 transforms ([9, 63]).

Columbia NDI Database: [96]

The database contains much greater variation in spatial translations and scale variations. There are 150 ND pairs (300 images) and 300 ND images [18, 38, 63, 98].

CityU Dataset: [97]

The set consists of 29 news topic covering 805 stories and resulting in 7006 shots [38, 63].

NTU Dataset: [99]

There are 150 ND pairs (300 images) and 300 ND images [63].

UKBench Dataset: [100]

The dataset contains 10,200 images where every four images capture a single object with the different viewpoint, illumination conditions and scale. There is a total of 2550 objects such as CD covers, etc. [102].

INRIA Dataset: [101]

The dataset contains 1491 images in 500 groups with a variety of scene types such as natural, man-made, water, sky, etc. These images were modified by applying different types of transformations like image rotation, image resizing, JPEG compression ranging from JPEG3 to JPEG 75, image cropping ranging from 5% to 80% of the image surface and by applying strong transformations such as print and scan, paint, change in contrast, perspective effect, blur, very strong crop and so on [53].

California ND Dataset: [ 103]

This dataset has 701 photos of a user’s personal photo collection. Many challenging non-identical near-duplicates of real-world scenes are provided in this collection. Detecting NDs is a first step required in managing photo collections. This dataset helps the researchers to test their NDD algorithms with real photos [2]. The first 604 photos were consecutively selected from the beginning of the collection. The remaining 97 photos consist of pairs of interesting cases from the same collection that is not present among the first 604 photos. The images undergone transformations such as burst shots (344 photos), moving background shots (149 photos), show/performance shots (54 photos), group photos (17 photos), panorama shots (8 photos), exposure/brightness difference (58 photos), viewpoint difference/zooming (36 photos), focus change (22 photos), white balance difference (8 photos). The case of burst, performance and panorama shots have not been considered in any other datasets.

COVERAGE Database: [104]

The COVERAGE (COpy-moVeforgERydAtabase with similar but GenuineobjEcts) contains 100 original images and 100 forged images [3]. In this database, the forged/near-duplicate images have been obtained by performing transformations such as Translation, Scaling, Rotation, Free-Form, Illumination and Combination. Out of 100 near-duplicate images, 16 images fall under translation transformation, 16 images under scaling, 16 images under rotation, 16 images under free-form, 16 images under illumination and 20 images under combination transformation [3].

CoMoFoD Database: [105]

The CoMoFoD (Copy-Move Forgery Detection) has 200 small images. In this database, the forged/near-duplicate images have been obtained by performing transformations such as Translation, Scaling, Rotation, Free-Form, Illumination and Combination [106]. Also subjected to the following post-processing methods to derive many more duplicate images and to make the forgery perfect - Brightness Change (3), Contrast adjustment (3), Color reduction (3), Image blurring (3), JPEG compression (9), Noise addition (3). These 24 transformations are done for both the original and forged image resulting in 48 transformations [107].

CASIA Database: [107]

The CASIA V1.0 dataset has 800 authentic images mostly collected from Corel image set and it also has 921 spliced color images in JPEG format. Image splicing is the process of cutting certain regions of one image and pasting them on the same or other images without performing post-processing. The CASIA V2.0 database consists of 7491 authentic more realistic images and 5123 tampered color images with post-processed boundaries of spliced regions.

MICC-F220 and MICC-F2000: [108]

Irene Amerini [76] created MICC-F220 a database with 220 images and MICC-F2000 a database with 2000 images. MICC-F220 consists of 110 original images and 110 tampered images and the image resolution varies from 722 × 480 to 800 × 600. The average size of the forged patch is 1.2% of the image size. MICC-F2000 has 1300 original images and 700 tampered images with the resolution of 2048 × 1536 pixels. MICC-F600 dataset created by the same authors has 440 original images, 160 tampered images and 160 ground truth images.

The CoMoFoD, Manipulation (Manip) and GRIP datasets [3] provide both the forged and original images. Among these, only CoMoFoD considers complex manipulations for forged image analysis. From these discussions, the CoMoFoD database is found to be more suitable for justifying the near-duplicate detection method as there are various complex transformations. If the proposed method detects near duplicate images properly in this database, then that method will be more appropriate and best suited for all types of duplication/forgeries detection. The performance of various feature extraction methods tested with different near-duplicate datasets is compared in Table 4.

Table 4 Performance comparison of features extraction methods on different datasets

8 Applications of Near-Duplicate Detection of Images

Some of the applications developed by the researchers based on NDD are given below

  1. 1.

    Identifying and eliminating ND spam reviews, thereby providing a summary of the trusted reviews for customers to make buying decisions [111]

  2. 2.

    Fabricated Picture Detection [112]

  3. 3.

    Zhang et al. presented a face annotation system to collect and label celebrity faces automatically from the web [113]

  4. 4.

    Face matching system for post-disaster family reunification [114]

  5. 5.

    Logo recognition technique based on feature bundling [115]

  6. 6.

    Logo search is required in many real-world applications [116]

  7. 7.

    Building a digital library and postal automation [57]

  8. 8.

    Hand gesture recognition [63]

  9. 9.

    Resource utilization and traffic alleviation in many network architectures and leveraging in-network storage for various content-centric sessions [117]

  10. 10.

    Visual media intelligence in detecting similar visual content published on the web in a short period of time [118]

9 Challenges and Future Research Directions

A major challenge for the NDD is the runtime performance. If the runtime is more, then it leads to a huge waste of computational resources. The detection efficiency of traditional approaches is not satisfactory because the peculiarity of the NDD is very huge and some “hot spot” images have too many duplicates while some have very few. On the whole, the present systems fail to produce good results when the duplicates are created by large rotation, large scaling, many simultaneous image manipulations, change in the viewpoint and transformations in image tampering. These systems also suffer from the huge waste of network resources, high computational complexity, high false identification and low accuracy. The main challenge in this area is to find out the original image from the set of similar images. Also, due to the presence of NDs, the search time increases in web search engines. Even though many researchers worked on the scalability and reliability issue, still those issues should be addressed and improved a lot.

10 Conclusion

The search engines face a problem leading to the fact that the search results are of less relevance to the user due to the presence of near-duplicate images. Since the digital content is widespread and also easily redistributable, the presence of near duplicates affects the performance of the search engines critically. In this paper, the state-of-the-art approaches of detecting near-duplicate images and their applications are reviewed along with the methodologies used. The main challenges in this field are discussed. This review provides research directions to the fellow researchers who are interested to work in this field. Also, the survey concentrated only on the feature extraction task in computer vision system. There are many more tasks and techniques to do research.