1 Introduction

Near duplicate image detection is a key component of Internet or large-scale image processing and analysis. For visual search engines, the duplicated images are considered image spams [22], and can be regarded as useful data to be mined [24]. More interestingly, copyrighted/offensive contents and spam can be filtered, or same category images can be found by search near duplicate images from the manually pre-confirmed dataset [14, 25].

Finding near duplicate images among millions to billions of image datasets is very challenging as in a web-scale image search, however this is different from a general image search, which finds approximate nearest neighbors. First, near duplicate image detection should answer the true or false question by determining a threshold controlling true/false positives. False positives eliminate valuable images, and false negatives result in insufficient removal of identical images. Second, more resources are needed for the following main multimedia applications, in which the detection process acts as a pre-filter or underlining sub-module. The image description should be compact for storage and access efficiency, and duplicate detection needs to be very fast and accurate with light memory usage. In addition, it is desirable that the technical solutions are easily adapted to general image search problems [2].

The early near duplicate image detectors were based on the hashing-based approaches adapted from document processing. While image search engines retrieve similar or relevant images in a ranked order by measuring distances in high-dimensional feature space, the near duplicate image detectors have been developed to find perceptually/visually identical images by counting hash key collisions between binary codes. Ke et al. [14] introduced a near duplicate detection system by employing locality-sentive hashing (LSH) to index local descriptors, but it was only applied to a small dataset. Chum et al. [5, 6] adopted the min-Hash algorithm to local descriptors instead of LSH and extended the scalability by improving efficiency and accuracy. Those local descriptors are robust against viewpoint changes [6] and partial occlusions [12], but the computational cost is very expensive.

Others represent images by global features called signatures or fingerprint in the field of content-based image retrieval, and this is found to be scalable to millions to billions images. Zhang et al. [28] represented an image by the average intensities on a regular grid, and its concatenation is defined as a global image feature. The feature vector is projected by PCA and quantized into a hash code. Their experimental results are shown on 2.5 million images, and similar algorithms was reported to work in billions of images in an image-based annotation system [24]. Douze et al. [7] demonstrated the Gist descriptor, which was originally proposed as a computational model of the recognition of real world scenes [19], is suitable for large-scale near duplicate detection. Their method outperformed the state-of-the-art method, i.e., bag-of-features based on local features, in terms of accuracy, memory efficiency, and search speed. In the bag-of-features image search framework, they quantized the Gist features by k-means clustering, and the residual vectors were embedded into hamming space. Their efficient indexing structure of the Gist descriptor resulted in comparable accuracy to exhaustive search.

In the field of near-duplicate image detection, other visual descriptor are proposed instead of Gist descriptor. [17] proposed a variable-length signature and utilized the earth mover’s distance to compare the variable-length signatures. Furthermore, in [29], an efficient coarse-to-fine Riemannian image serch strategy was developed to improve efficiency while keeping accuracy.

Recently, learning-based transform coding has been getting attention for approximate nearest neighbor search because it represents images or high dimensional feature vectors as compact binary codes while preserving their neighborhood structure and/or relative distances. Torralba et al. [20, 21] employed supervised machine learning methods based on boosting and the Restricted Boltzman Machines (RBMs) to compress the Gist descriptor into a few hundred bits. Other approaches that have been utilized include unsupervised or semi-supervised machine learning techniques. Those methods mostly depend on PCA for dimensionality reduction and then encode the projected vectors into binary codes using various quantization techniques [3, 8, 13, 23, 26]. In [26], spectral hashing (SH) method assigned more bits to more relevant directions, and in [23], the semi-supervised hashing (SSH) algorithm relaxed the orthogonality constraints of PCA. Jegou et al. [13] showed that applying a random orthogonal transformation to the PCA-projected data works better than SH and SSH. Gong and Lazebnik [8] improved the previous methods by applying an orthogonal transformation, which directly minimizes the quantization error, called iterative quantization (PCA-ITQ). Furthermore, the relationship between nearest neighbor search and signal quantization was investigated by both Chandrasekhar et al. [4] and Brandt [3].

Most of the previous PCA hashing applied to image descriptor focus on similarity preserving property and coding error minimization but do not pay much attention to optimizing the performance in terms of receiver operating characteristic (ROC) curve or precision-recall curve. Near-duplicate image detection is one of the important applications belonging to this category. For example, in an e-commerce service, near-duplicate products should be hidden or grouped in the search results. Also, for content-based image search, near-duplicate images should be grouped with high precision in order to extract shared keywords from surrounding texts. Another example is copyright protection, except original copyrighted images, other near-duplicate images should be prohibited to be exposed in public or search results. In this paper, we present a fast and efficient large-scale system for near duplicate image detection based on PCA hashing. While other transform coding methods are interested in nearest neighbor search for similar image search, we introduce an effective and efficient range search method aiming to solve near duplicate detection problem by revisiting Gist-PCA hashing. Our contributions are threefold:

  • We argue that a PCA binary coding of the scene gist is better for near duplicate image detection because a PCA binary embedding can keep as much distance information as a PCA transformation does (Section 2.)

  • We propose a scalable method of near duplicate image detection by decomposing a Gist-PCA binary code into a hash key and a residual binary code, for large-scale duplicate image detection (Section 3.)

  • The multi-block approach, i.e., encoding cropped image regions as well as a holistic image region, can be further incorporated to effectively deal with strong variations, such as image cropping and border framing (Section 4.)

The paper is organized as follows. Section 2 explains how to encode the global image characteristics based on Gist-PCA hashing. In Sections 3 and 4, we introduce the bit decomposition and multi-block approach for large-scale retrieval and the accuracy improvement, respectively. Experimental results are presented in Section 5, and we conclude in Section 6.

2 Binary Image Representation for Near Duplicate Detection

The near duplicate images are herein defined as the images related by image-to-image transformations, including various photometric and geometric deformations, such as contrast change, jpeg compression, resizing, cropping, border framing, and watermarking. In other words, they are regarded as perceptually identical images instead of images taken from identical objects/scenes [5, 7]. The examples of near duplicate image variations Footnote 1 are presented in Fig. 1, and more details will be given later in Section 5.

Figure 1
figure 1

Simulated image variation for the near duplicates and the Hamming distance from the original image. The first rows are the original image and its near duplicate variations. In the second row, the near duplicate variation name and their hamming distance for the 128-bit binary code are shown.

To encode the perceptual similarity of images, the Gist descriptor is used. It is originally proposed to capture a set of perceptual dimensions of visual scenes in [19], and recently, it has shown promising results for image search [7]. The Gist descriptor is extracted by concatenating 20 Gabor responses to 4x4 non-overlapping image blocks, and it is represented by a 960-dimensional real-valued vector, for color images. The visual similarity between different images can be measured by computing the Euclidean (L2) distance in the Gist vector space.

Given a query image Iq, our goal is to search the near duplicate images in the database \(\{I_{1}, \cdots , I_{N}\} \in \mathcal {I}_{ref}\). Their corresponding Gist descriptors are represented by \(\mathbf {x}_{q} \in \mathbb {R}^{n}\) and \(\mathcal {X} = \{\mathbf {x}_{1}, \cdots , \mathbf {x}_{N} \} \in \mathbb {R}^{n}\), respectively, where n is vector dimension. The image similarity between the query xq and a database image \(\mathbf {x}_{i} \in \mathcal {X} \) is measured by the L2 distance, i.e., d(xq, xi) = ∥xqxi2. The set of near duplicate images \(\mathcal {X}_{q}\) are determined by checking that the Gist descriptor of a database image is located within a certain distance, dGist, from a query vector, i.e., \( \mathcal {X}_{q} = \mathbf {x}_{i} | d(\mathbf {x}_{q}, \mathbf {x}_{i}) \leq d_{Gist}, \mathbf {x}_{i} \in \mathcal {X} \}, \) and the other database images, i.e., the complements in \(\mathcal {I}_{ref}\), outside the range are non-duplicate images: \( \mathcal {X}_{q}^{\complement }\).

The determination of the distance threshold, dGist, is critical for the performance of near duplicate detection. When the threshold is too large, non-duplicate images are misclassified as near duplicate images, and when it is too small, many near duplicate images are missed. It can be fixed for all the queries or vary adaptive to each query. For the near duplicate image detection, we found that the advantage of the adaptive threshold is not big after binary quantization, and it is not easy to find the adaptive threshold for a query. The fixed threshold is used in this paper, and it can be determined in the training stage when the allowed false acceptance rate is given.

2.1 Binary Coding

To deal with millions and billions of large-scale images, the size of the real-valued high-dimensional descriptors is too large to store, and the distance computation is too heavy to be instantly performed, so the compression of the descriptor is crucial. For compact binary coding, while other (approximate) nearest neighbor methods only maintains the local neighborhood structure, we try to find a binary embedding that keeps the distance information as much as possible, so that a range query can discriminate near-duplicate and non-duplicate images. In other words, when the near duplicate images are within a certain range in the original vector space, it is expected that they can also be found in a certain range in the quantized space.

An m-bit binary coder, \( {h}: \mathbb {R}^{n} \rightarrow \{0, 1\}^{m}, \) of a n-dimensional vector is designed to retain the original distance structure as much as possible. In the transform coding framework, the binary coding can be broken into two steps. First, the Gist descriptor (\(\mathbf {x} \in \mathcal {X}\)) is transformed into a low-dimensional vector space \(f: \mathbb {R}^{n} \rightarrow \mathbb {R}^{k}\), then the vector are encoded into a binary code by \(b: \mathbb {R}^{k} \rightarrow \{0, 1\}^{m} \). The first step is called transform coding (f(⋅)) and the next is binary quantization (b(⋅)).

Let h(xq) and h(xi), be the m-bit binary codes corresponding to the Gist vector of a query image xq and that of a database image \(\mathbf {x}_{i} \in \mathcal {X}\), respectively. We then model the Gist vectors, xq and xi, as random vectors drawn from an underlying unknown distribution. The sets of the binary codes belonging to the ground truth of near duplicate images and the others are denoted by \(\mathcal {B}_{q}\) and \(\mathcal {B}_{q}^{\complement }\) for the query h(xq), where h(xi) ∈ \(\mathcal {Y} \equiv \{0, 1\}^{m}\), and \(\mathcal {B} = \mathcal {B}_{q} \bigcup \mathcal {B}_{q}^{\complement }\). Therefore, a desirable binary coding needs to maximize the range preserving probability:

$$ \sum\limits_{i = 1}^{N} {P\left( h (\mathbf{x}_{i}) \!\in\! \mathcal{B}_{q} | \mathbf{x_{i}} \!\in\! \mathcal{X}_{q}, \mathbf{x}_{q}\right) \,-\, \lambda P\left( h (\mathbf{x}_{i}) \!\in\! \mathcal{B}_{q} | \mathbf{x_{i}} \!\in\! \mathcal{X}_{q}^{\complement}, \mathbf{x}_{q} \right)}, $$
(1)

for any query xq, where λ is a weighting factor. The binary code maximizes the probability that true duplicates are located within a certain range and at the same time penalizes the probability that non-duplicates are inside the range. In terms of ROC criteria, the first and second terms are correponding to high detection rate and low false acceptance rate, respectively. While most of the previous hashing methods have been applied to image descriptor focusing on similarity preserving property and coding error minimization, we are optimizing the performance in terms of receiver operating characteristic (ROC) curve in Eq. 1.

2.2 Gist-PCA Hashing

To find the optimal binary coding maximizing of the range preserving probability (Eq. 1), PCA hashing is investigated. Solving Eq. 1 is intractable because of the combinatorial property of binary codes. In the paper, we describe why PCA hashing is a better choice compared to other state-of-the-art hashing methods with the similar preserving property. We will explain it with comparison study and the theoretical analysis for comparison of the performance of PCAH and that of other state-of-art methods.

PCA is a linear transformation that allows coordinate axes to be determined in such a way as to retain as much distance information as possible in a mean-square sense [27]. The principal axes of PCA learns which axes have the largest variations, where the variations effect the L2 distance (i.e., mean square distance) most [3]. PCA embeds a n-dimentional Gist descriptor xi into a k-dim vector space (kn.) Given a training set, the Gist vectors are mean centered by subtracting the data mean, denoted by t, and they concatenate into a n × N matrix. From the covariance matrix, the eigenvectors with the k-largest eigenvalues construct transform coordinates, which are denoted by W = {w1,⋯ , wk}.

Using the learned transform matrix W, PCA hashing is straightforward. First, a mean centered Gist vector \(\tilde {\mathbf {x}_{i}}\)\(= \mathbf {x}_{i}-\bar {\mathbf {x}} \in \mathbb {R}^{n}\) is projected into a low-dimensional vector space by PCA: \(f (\tilde {\mathbf {x}_{i}}) \) = \(\left (f_{1} (\tilde {\mathbf {x}_{i}}), \cdots , f_{k} (\tilde {\mathbf {x}_{i}}) \right )\) = \(\left (\mathbf {w}_{1}^{T} \tilde {\mathbf {x}_{i}}, \cdots , \mathbf {w}_{k}^{T} \tilde {\mathbf {x}_{i}} \right ),\) then it is quantized into a binary bit in each dimension: b(yj) = Q(yj), where \(y_{j} = f_{j} (\tilde {\mathbf {x}_{j}})\) and Q(yj) = sgn(yj). Therefore, the binary coding is represented by

$$\begin{array}{@{}rcl@{}} \mathbf{z}_{PCAH} &=& h({\mathbf{x}_{i}}) = b \circ f(\tilde{\mathbf{x}_{i}})\\ &=& \left( Q(\mathbf{w}_{1}^{T} (\mathbf{x}_{i} - \bar{\mathbf{x}})), \cdots, Q(\mathbf{w}_{k}^{T} (\mathbf{x}_{i} - \bar{\mathbf{x}})) \right), \end{array} $$
(2)

for any gist vector xi.

This approach is deceptively simple, but it turns out that this approach is better than other state-of-the-art methods in terms of the range preserving probability in Eq. 1. The binary coder in Eq. 2 is only PCA followed by a binary quantizer with a hard threshold. In the perspective of hashing, the Gist vector space is partitioned by the hyper planes orthogonal to the PCA principal axes wj, for j = 1,⋯ , k, which have the most variations.

We assume that the maximization of the range preserving probability requires the preservation of the L2 distance metric of the original feature space after binary hashing as much as possible. The problem is how to represent the feature vectors using a small number of binary bits. First, the PCA finds the most informative axes. The PCA axis corresponding to the largest eigenvalue will preserve the L2 distance mostly faithfully because the axis contains most variations. Second, the PCA minimizes the reconstruction error between the original feature and the projection because it retains the maximum variation.

State-of-the-art methods

Before comparing PCAH with other state-of-art methods including SH [26], LSH [11], and ITQ [8], the major differences between each methods are investigated. For given n-dimensional gist feature(x), the feature is translated so that they are of zero mean(\(\mathbf {y}= \mathbf {x} - \bar {\mathbf {x}}\)). Then, LSH projects y onto d-random axes, producing d-dimensional feature yLSH = y ∗ (ALSH) and quantize each dimension to get final hash code (zLSH = sgn(yLSH)). PCAH, ITQ, and SH transform y with PCA to get \(\mathbf {y}^{\prime }\). PCAH directly quantizes each component of \(\mathbf {y}^{\prime }\) by thresholding at zero(\(\mathbf {z}_{PCAH}=sgn(\mathbf {y}^{\prime })\)). ITQ rotates \(\mathbf {y}^{\prime }\) by multiplying an orthogonal matrix to minimize the quantization error, and quantize by thresholding at zero(\(\mathbf {z}_{ITQ}=sgn(\mathbf {y}^{\prime }*A_{ITQ}\)). SH produces d eigen functions which apply equally-spaced thresholds to each component of \(\mathbf {y}^{\prime }\), resulting in total dn components, and take d components of smallest eigenvalues. Therefore, each component of \(\mathbf {y}^{\prime }\) can be assigned varying number of bits and each bit can be quantized with many thresholds(note that PCAH, ITQ, and LSH have only one threshold at zero).

2.3 Comparison Study

We experiment on Copydays [13], which is the most popular benchmark for near-duplicate detection. Peekaboom dataset is used for training hash functions. Figure 2a shows a ROC curve of the state-of-the arts methods and the proposed method. As shown in Fig. 2b PCA hashing has a wider range of acceptable thresholds than ITQ and LSH. On the other hand, Fig. 2c shows that PCA hashing has a detection rate larger than the SH for a fixed threshold. This observation will be also theoretically backed up to explore the performance of PCA hashing in a ROC curve.

Figure 2
figure 2

The comparison of ROC, FAR, and DR for the compared methods.

Also, Fig. 3 shows visual results of the compared methods. the correctly retrieved images, the correctly rejected images, and falsely detected images are marked in cyan, yellow, and margenta, respectively. PCA hashing has a better discriminativing power to seperate the true positive (correctly retrieved) and true negative (correclty rejected) images.

Figure 3
figure 3

The example of the distance statistics of near-duplicate images from an original image.

The first step will show that, for fixed low false alarm rates, the range of the hamming ball size (threshold) for PCA that meets FAR constraints is larger than other methods. In this step, you will see that PCA has smaller variances in hamming distances between arbitrary queries, which results in the larger range of the permitted threshold.

The Relationship between FAR and Its Threshold

For arbitrary images A and B, the false alarm rate for threshold γ can be written as

$$\begin{array}{@{}rcl@{}} FAR=P(d_{H}(A,B)\le \gamma \vert L(A)\ne L(B))\\=P(d_{H}(q_{1},q_{2})\le \gamma), \end{array} $$
(3)

where q1 and q2 are the binary hash codes of arbitrary but distinct query images.

Now, we will focus on the maximum threshold \((\gamma _{\max })\) allowable for a fixed low false alarm rate of each hashing method. A typical FAR is in the range of 0.01 − 0.1, which is much smaller than 0.5. Then, the analysis of dH(q1, q2), which is the hamming distance of arbitrary queries q1 and q2, is necessary to compare the \(\gamma _{\max }\) of each hashing methods. The key property of dH(q1, q2) is that its mean is fixed to d/2 for any hashing method that assigns 0 and 1 with equal probability(i.e. mean-centered bits) (\(E\lbrack d_{H}(\mathbf {q}_{1},\mathbf {q}_{2})\rbrack =E\lbrack {\sum }_{n = 1}^{d}{d_{H}(\mathbf {q}_{1_{n}},\mathbf {q}_{2_{n}})}\rbrack ={\sum }_{n = 1}^{d}{E[d_{H}(\mathbf {q}_{1_{n}},\mathbf {q}_{2_{n}})]}= {\sum }_{k = 1}^{d}{0.5}=\frac {d}{2}\)). Most hashing methods, including PCAH, SH, LSH, and ITQ, adopt a mean-centering process (translation of data points so that they are of zero mean), which results in the equal probability of 0 and 1 for every bit. Therefore, for the methods we will deal with, we can assume that he mean value of dH(q1, q2) is fixed to d/2. Next, we consider Var(dH) (the variance of dH(q1, q2)) of each method. If the variance is large, as in the case of ITQ in Table 1, the \(\gamma _{\max }\) s.t. P(dH(q1, q2) ≤ γ) = FAR will be small. This observation simplifies our original purpose of showing that PCAH has minimum variance, and that other methods have variances larger than that of PCAH. This statement is strongly supports the notion that PCAH has the largest largest \(\gamma _{\max }\) with an accompanying low false alarm rate.

Table 1 The comparision of variance of the-state-of-the-art methods.

3 Binary Code Decomposition for Scalability

A PCA projection of the Gist image description is decomposed into a hash key and a residual binary code for fast neighbor search and compact storage, respectively. A PCA-hashing-based binary hashing method of the Gist image descriptor is introduced by decomposing a PCA quantized vector into a hash key and a residual binary code for efficient storage and search. The k-most significant bits of the binary code are used as a hash key for coarse vector quantization, then the retrieved codes are finely measured in Hamming distance for the residual bits. The PCA transform coding followed by quantization can more effectively separate near duplicate images from other visually similar non-duplicate images, compared to other sophisticated coding techniques [3, 8] and even to the uncompressed real-valued Gist descriptor. Also, we show that it is closely related to the bag-of-feature image search architecture with hamming embedding [7].

To retain enough information, the length of the PCA-based binary code will be at least 64 bit, which contains 78.9% of the energy of the data, with 128-bit length where 87.4% is desirable for practical applications. The exhaustive search is too expensive in large-scale datasets, and the hash table size will be too large to fit into memory. In [21], the 30-bit was a practical maximum at a moderate-level of PC memory (230 × 8 bytes/table entry = 8G Bytes) in a single machine. To deal with the longer binary code, we decompose the binary code into a hash key and a residual binary code. The k-bit binary code in Eq. 2 is decomposed into the following two codes: h(xi) = (h1(xi), h2(xi)), where \(h_{1}(\mathbf {x}_{i}) = \left (Q(\mathbf {w}_{1}^{T} \mathbf {x}_{i} - \bar {\mathbf {x}}), \cdots , Q(\mathbf {w}_{l}^{T} \mathbf {x}_{i} - \bar {\mathbf {x}}) \right )\), \(h_{2}(\mathbf {x}_{i}) = \left (Q(\mathbf {w}_{l + 1}^{T} \mathbf {x}_{i} - \bar {\mathbf {x}}), \cdots , Q(\mathbf {w}_{k}^{T} \mathbf {x}_{i} - \bar {\mathbf {x}}) \right )\), and lk, for any Gist vector xi. The first code h1(xi) is the hash key, which corresponds to the l-most significant bits of the binary code, and the second code h2(xi) is the residual binary code, which stored for the finer range measure. It can be interpreted that the Gist vector space is divided by the hyperplanes orthogonal to the principal axes of PCA (equivalent to K-means clusters), and then each cluster is further quantized by the residual binary bits. The approach is closely related to the bag-of-feature image descriptor with hamming embedding [7]. The hash key and residual bits correspond to the bag-of-feature quantizer and hamming embedding, respectively, but the computational cost is expensive because the bag-of-feature quantizer needs an exhaustive distance comparison of the Gist descriptors.

To build the indexing structure, the l-bit hash key constructs a hash table with a size of 2l, and the residual codes are stored into the corresponding hash buckets in linked lists. Given a query xq, its binary code h(xq) is decomposed into h1(xq) and h2(xq). Using the hash key h1(xq), the buckets, including relevant images, are selected when they are located inside the range dkey. The buckets are found by selecting all the hash keys within the hamming ball of size dkey, and the selected buckets include the candidate items for near duplicate images: \(\mathcal {N}_{h_{1}} (\mathbf {x}_{q})= \left \{ \mathbf {x}_{i} | d_{H} \left (h_{1} (\mathbf {x}_{q}), h_{1} (\mathbf {x}_{i}) \right ) < d_{key} \right \}\). The buckets can be directly accessed by the address offset by the sum of xor between h1(xq) and the precomputed hamming ball. The distance of the items for each bucket is measured in hamming distance between the full-length binary code: \(d_{H} \left (h (\mathbf {x}_{q}), h (\mathbf {x}_{i}) \right )\). The distance can also be computed by \(d_{H} \left (h_{1} (\mathbf {x}_{q}), h_{1} (\mathbf {x}_{i}) \right ) + d_{H} \left (h_{2} (\mathbf {x}_{q}), h_{2} (\mathbf {x}_{i}) \right )\), where the first distance is the same for the items in the same bucket. The final items are an intersection of those within the range distance d in the full length, and those are stored in the buckets whose key values are within the range distance dkey from the query vector. The near duplicate image set is represented by \( \mathcal {N}_{h} (\mathbf {x}_{q}) = \left \{ \mathbf {x}_{i} | d_{H} \left (h_{1} (\mathbf {x}_{q}), h_{1} (\mathbf {x}_{i}) \right ) < d_{key}\right \} \bigcap \left \{ \mathbf {x}_{i} | d_{H} \left (h (\mathbf {x}_{q}), h (\mathbf {x}_{i}) \right )\right .\)\(\left .< d \right \}, \) for all the database images Ii.

In this method, there are three important parameters to be determined: the hash key size l and the ranges dkey and d (or dres = ddkey). The hash key size l is related to the size of hash table, and it also limits the maximum recall of the near duplicates. If the size becomes larger, i.e., the hash table size increases, the achievable maximum recall increases. In other words, the larger values give better recall but need larger memory size and more buckets to be searched. The distance range dkey in the hash table determines the hamming ball size achieving the maximum recall or the best recall with small hamming ball size. The last parameter d is the range parameter, which determines the maximum detection rate for the allowed false alarm given by the system.

4 Multi-block Image Description

Although the Gist feature is effective in near-duplicate image representation, it is limited as a global feature. It is effective for encoding the holistic region, but the local image region or local details cannot be explicitly encoded. Therefore, the performance degradation is seen in strong variations, such as cropping and border framing. We solve the problem using a multi-block approach, which encodes a cropped region as well as the holistic region of an original image.

We use multiple query images instead of a single query image. The multiple query images include the original query image and its cropped images. Although other image variations can be further used, the cropping variation is most effective in our near-duplcate image detection scenario. We call it the multi-block image description because it is very similar to the multi-block approach used in face recognition [10].

The incorporation of the encoding of the cropped region plays several roles in near duplicate image detection. First, since the cropped image regions not only corresponds to one instant of cropping variations of a query image but also removes the border framing from the database images, it can deal with some of typical and error-prune variations of the near duplicate images. The cropped image region acts as a new query, and it is a kind of query expansion, which turns out to be very effective in bag-of-feature methods for local features. Second, a hybrid feature from the multiple image regions effectively increases the discriminability of the feature because the cropped region gives a slight different perceptual set of the Gist features, which results in more discriminability.

The multi-block approach produces multiple binary codes for each image, so they should be combined. Two alternative method is possible to compute distance. First, the multiple codes can be concatenated into a single binary code, but this approach increase the code length and it is hard to combine with the previous hash-based large scale search. Another approach is to multiple binary codes for each image and compare all possible distances and take the minimum distance. This is very natural in the context of hash-table-based search. Given a query image, the query image has multiple binary features. For each binary features, the possible candidates of the near-duplicate images have retrieved. Without directly computing minimum distances of the possible pairs, we decide the image as a near-duplicate image only if one of the binary code is shorter than the query.

5 Experimental Results

We experiment on publicly available datasets, including Copydays [7], Peekaboom [1], Mirflickr [9], and Tiny images [20]. Copydays are mostly used to test the accuracy of the compared methods, and Mirflickr and Tiny images act as distractors for scalability tests. For training the transform matrix and tunning the parameters, a Peekaboom dataset is used. Half of the set is used for training and the other half is used for tunning the parameters and for evalution. For an accuracy evalution, we measure the mean average precision (mAP), which is computed from a precision/recall curve, and we measure how many relevant images are searched in the ranks [7]. Additionally, the detection rate (true positive) and false acceptance rate (false positive) are used to evaluate the performance as a detector or filter.

The near duplicate variations are tested in two ways. The variations provided in the Copydays dataset is first used, and they includes image croppings in terms of surface ratio and jpeg compressions with image scale reduction. In addition, more variations are generated to simulate the diverse near duplicate images observed in real image collections, as shown in Fig. 1.

5.1 More Comparison Study

We compare the proposed method (Gist+PCA+hashing) with the state-of-the-art methods: the Gist descriptor (Gist) and its PCA projection (Gist+PCA); the hashing methods (Gist + PCA + VBA, Gist+PCA+ITQ, GIST+LSH), which correspond to the optimal bit allocation [3]; iterative quantization [8]; and locality sensitive hashing by random projection. They are also compared in the different dimension and code sizes: 64 and 128. The real-valued descriptors (Gist, Gist+PCA) are the benchmarks for the hashing techniques, and they are considered an upper bound for the performance.

The first rows of Fig. 4 showed comparison results for cropping and jpeg compression on the Copydays dataset in terms of mAP for 157 queries. The proposed method with 128-bits are the best performed in both cases, and the optimal bit allocation (Gist + PCA + VBA) follows. Interestingly, the 64-bit binary codes are comparable to other hashing techniques (Gist+PCA+ITQ, GIST+LSH) with 128-bit length. In the rows in Fig. 4, the ROC curves are presented to show the detection and false acceptance rates for the proposed code (Gist+PCA+hashing) and the iterative quantization code (Gist+PCA+ITQ). When the false acceptance rate is allowed at 0.05 (5%) in a system, the proposed method can achieve an almost 90% detection rate up to 20% cropping, while the iterative quantization method remains below 50%.

Figure 4
figure 4

Comparison study. (Top row) mean average precision in terms of cropping and jpeg compression. (Bottom row) The ROC curve of the proposed code (Left) and the iterative quantization code.

5.2 Decomposition and Parameter Tuning

To decompose the full-length binary code, we need to find the hash key length and the hamming ball size for the maximum recall, i.e., l and dkey. We parameterize the hash key length with 16, 32, 64, and 128 bits for 128-bit binary code in full-length, and at each length, the detection rate is computed with respect to the hamming ball size, i.e., the distance threshold. We set the parameters at the 10% cropped images (centercrop_10), i.e., 19% cropping surface, which is a strong attack. As shown in Fig. 5(left), For 16, 32, 64, and 128 bits, the maximum detection rate is achieved up to approximately 77%, 95%, 97%, and 99% with a hamming ball size of 4, 9, 21, and 59, respectively. Therefore, we set the parameters: l = 16 and dkey = 4, and in this setting, only 77% of near duplicate images can be tested for further full-length comparison. This results in inevitable accuracy degradation for the sake of real-time detection. The final range threshold d is determined in the training stage when the allowed false acceptance rate is given. Typically, it ranges from 26 to 29 at 5% false acceptance.

Figure 5
figure 5

The hash table parameters and the decomposition verification. (Left) The detection rates plot with respect to hamming ball size for hash codes with various lengths. (Right) The compared mean average precision plots with respect to the near duplicate variations and the hashing methods before and after the code decomposition. The decomposition code is represented by ”Gist+PCA+hashTable” in the following plots because it uses the hash table for speedup.

The accuracy degradation by the decomposition is shown in Fig. 5. In this experiment, 25,000 images from Mirflickr are used as distractors, and Copydays are used for a test set. The degradation can be neglected up to centercrop 10% with 0.95 in mAP, and the mean average precision values are still acceptable for leftcrop 10% and centercrop 20% with 0.94 and 0.86, respectively. Even in the centercrop 20%, the degradation rate is less than 10%.

5.3 Effectiveness of the Multi-block Coding

The effectiveness of the multi-block encoding is shown in Fig. 6. When the full-length code is considered, the multi-block approach is slightly better than the PCA hashing, as was expected in some cases: border_w10 and border_b10. For the decomposed codes, the improvement is huge and the mean average precision remains high. For the border variation, the gain is over 2.1 and up to 3.1 in mean average precision. That shows that the multi-block approach is very effective for large-scale systems with strong variations.

Figure 6
figure 6

Multi-block coding. (Left) The full-length code. (Right) The decomposed code using the hash table. The number after “mb” is the cropping rate for the multi-block region, and mb0.0 means the multi-block scheme is not applied. The following “con” and “pw” means the codes are compared by concatenating the vectors or a pairwise comparision between multiple blocks, respectively.

5.4 Scalability to Large-Scale Images

Finally, scalability of the method is tested by increasing the size of the distractors: 0 (no distractor), 25,000 (Mirflickr 25K), 1,000,000 (Mirflickr 1M), and 10,000,000 (10 M random sample from Tiny images), Fig. 7 shows the results for both 64-bit and 128-bit binary codes. For 64-bit code, the mAP decreases very quickly and only centercrop 10% can be handled well. However, for 128-bit code, the mAP is good for the challenging variations, such as centercrop 20%, leftcrop 10%, and watermark_s2a5 under 10 million distractors.

Figure 7
figure 7

Scalibility test. (Left) 64-bit code. (Right) 128-bit code. Note that multi-block approach is not applied in the experiments.

To compute speed and memory usage at the larger scale, 29,000 queries from Peekaboom dataset for the centercrop 10% variation under 10 million distractors were tested and the results are averaged for stable computation. The RAM usage was 600M bytes, and the detection speed was 0.2 seconds on average on a single thread of 2.40GHz Intel Xeon CPU. The detection accuracy is 0.89 in mean average precision. When the hamming ball size of the hash table is reduced from three to four for the sake of accuracy, the speed is faster up to 0.07 sec, and the accuracy degradation is not big, 0.85 in mean average precision.

Finally, the algorithm is evaluated on the video data set with manual ground truth. We crawled 73,100 video thumbnails (i.e., 438,600 images, considering that each video has six thumbnails) with the most popular 673 keywords, and the near duplicate images are carefully verified by a human expert within each keywords. The different videos are determined as duplicates if more than three thumbnails are within a certain range. We set the false acceptance rate at 1.1%, and the detection rate of the algorithm is 78.4% with the distance range d = 26.

5.5 Computation Time

As shown in Table 2, PCAH are the most efficient method for training except for LSH, which does not require any training process. Supervised hashing methods such as MLH [18] and KSH [16] adopts optimization process for training, and requires much more time than other compared methods. Note that PCAH, ITQ, MLH and LSH spends same time for testing, because they just multiply the transform matrix of the same dimension(\(\mathbf {W}^{T} \in \mathbb {R}^{d \times d^{\prime }} \)). However, the test time of KSH and SH are much larger than others because KSH needs to compute the distance between anchor and test data, and SH uses multiple threshold for a single dimension. All the experiments were implemented in MATLAB and performed on the desktop with Intel-core i7 CPU at 3.30GHz and 16GB RAM.

Table 2 Training and testing time of each method on CIFAR100 dataset [15]+ 4k distractor.

6 Concluding Remarks

We proposed a scalable near duplicate image detection method by decomposing the binary code of Gist-PCA hashing and extending the holistic image region into multiple blocks. Comparison studies and experiment results verified the effectiveness in terms of accuracy, speed, and memory usage. In future, we will find a more compact code with sophisticated machine learning techniques and also extend this work to similar image search problem.