1 Introduction

Social media intelligence consists in collecting public media content in order to understand trends or drive communications and business strategy. It requires a massive processing of data, that must often be fast as well, in order to be able to react quickly to a particular event. It exists for a while some tools that process the textual content, but the advent of social web determined a rise of visual content (image/video) propagation on websites or on users’ online social network (OSN) profiles, either verbatim or with minor modifications. From a media intelligence perspective, it is very interesting to detect massive content propagation, even slightly tampered, since it usually relates to the same “social event”. This last relates here to an event in the real life: when a celebrity does something for instance, the (almost) same picture will be published many times. Such a phenomenon has also been observed in the early years of the Web with the diffusion of visual memes [37].

Near-duplicate image detection is of high interest to several multimedia applications and was thoroughly studied for checking unauthorized use of protected visual content. Our focus here is nevertheless on another application, namely visual media intelligence, which consists in detecting similar visual content published on the (social) web in a relative short period of time. We argue that under certain conditions, near-duplicate image detection is a good model to respond to the problem of detecting a massive content propagation of visual content that could feed a platform of social media intelligence.

Technically, regardless of its use, near-duplicate copy detection can be cast as a content based retrieval task. An important characteristic to be taken into account in media intelligence is that processed content comes as a stream of visual data, that must be processed continuously. To do this, the domain of copy detection usually focus on the speed of the search process and on the robustness to different image transformations. Consequently, most of the state of the art approaches rely on visual signatures that are built from local features, later aggregated into a small vector in order to speed up the search process. However, local features computation and aggregation has a non-negligible cost and indexing is usually performed offline. In the media intelligence scenario, the computation time of the visual signatures has to keep the pace with the reception of new input data. More precisely, the “indexing+search” operation must be executed at a higher rate than that of new data collection. For instance, if a system ingests half-million visual multimedia items per day, their comparison with recent content (assumed to include 10 to 100 million documents, already indexed during the previous months from daily inputs and reduced to significant and interesting ones only) must be performed in less than \(\frac {24 \times 3600}{500000}=172.8\) milliseconds, that is to say around 6 images per second. Such a processing rate requirement makes the use of signatures based on transforms and compression of local features difficult to use if computational resources are constrained. In practice, since the computational capacities directly impact the cost of the media intelligence platform, it is interesting to minimize their need for this particular processing.

Hence, to support such a processing rate, one must wonder the real need of near-duplicate detection in the context of visual media intelligence. The authors of [34] performed a user study which shows that the most frequent image transformations are usually quite simple. Frequent transformations include: format conversion, rescaling, changing the image aspect ratio, small crops or rotation, image or text embedding, (JPEG) compression, color transforms (“filters”), and possibly right-left flip. From a practical usage perspective, copy detection methods that aims at detecting re-post of a similar content should focus on these usual transformations in order to maximize their effectiveness.

The objective of this paper is thus to address the problem of fast indexing and searching of visual content in a media intelligence scenario. To achieve this objective, we propose a new signature which has the following properties: (i) fast to compute (≤ 5 ms on a single core Intel Xeon processor @ 2.10 GHz) (ii) very compact (< 100 bytes), thus enabling fast search of a large database (107 to 108 images), even exhaustively (iii) suitable for an inverse index representation to speed up the search (iv) robust to frequent Web image transformations. Our approach consists in extracting a compact hash signature, based on a particular pattern that make it quasi invariant to resizing and left-right flips. We refine its design in order to make it robust to other usual image transformations on the Web, such as blur, rotation, crop, image compression as well as text and image incrustation.

This paper is built upon our prior conference publication [10] and include the following additions: 1) we propose a new scheme to enhance the signature that does not change its memory print (thus it preserves its efficiency at searching time) but improve its robustness to ’crop’ transformations. 2) we compare the proposed pattern to alternative ones, and thus exhibits its most important aspects. 3) we conducted complementary experiments, including a run on a corpus containing more than 100 millions images.

We present recent and older works relative to our problem in Section 2. Then the proposed method is presented in Section 3, including the design of our “core” hash signature, its enhancement to make it more robust and the associated method for efficient search. In Section 4 we report a set of experiments that demonstrates the efficiency of the proposed approach, in comparison to alternative methods and former state-of-the-art works. The Section 5 is dedicated to highlights the main points of this paper and draw some perspectives to this work.

2 Related work

Watermarking-based approaches rely on a “mark” inserted into a document. The digital watermark must be perceptually invisible to a human and robust to the transforms the document can be subjected to. In addition to robustness, watermarking requires to perform the insertion from the beginning, a constraint which is unrealistic if the publisher does not control the initial publication of the content. A more flexible solution to copy detection is to find of near-duplicate documents directly from content. This process is an instantiation of the content-based image retrieval (CBIR), which first extracts a vectorial representation of visual content and then exploits it to find similar images in a test database. The choice of the image representation is usually made to ensure a good compromise between result accuracy and speed, two characteristics which are often contradictory.

From the first systems dedicated to copy detection [4] up to mid 2000’s, many works were based on the use of global descriptors [21], in particular to detect repeated “clips” into a video stream [6, 23]. Although fast to compute, these methods usually suffered from a poor robustness to severe image transforms. They were thus progressively replaced by local feature based approaches [3, 8, 19]. These consist in detecting a region around points of interest then describing them with robust descriptors such as SIFT [22], SURF [2] or robust local binary patterns [13]. Local features are quantized according to a pre-defined codebook and then indexed into an inverse-file structure, which corresponds to a forward index described by bag-of-features (BoF) with hard coding [31], and can thus be retrieved efficiently. Further improvements have been proposed to compress or enhance this representation, making the retrieval process faster and more precise. In image classification, it has been shown that finer coding strategies can lead to better BoF representation [15]. However, for retrieval, few of these strategies have been tested. The improvement proposed in [16] better takes into account the local features quantization and add an efficient way to consider the spatial arrangement of interest points in each image. For very large scale databases, local descriptors are usually aggregated in a unique vector that describes the deviation of a given image with respect to an average representation of the visual world. This (VLAD or Fisher Kernel) vector is then compressed using PQ-codes [17] in order to be able to efficiently search a large database while maintaining good precision [18].

This compression of large signatures is in the vein of works aiming at representing images with compact signatures, that is to say a hash, that reflect the semantic of the image. Hashing based image representations were also used in the domain of image authentication and forensics, named robust image hashing [32]. Regarding content-based image retrieval, the semantic hashing [27] consists in learning a compact binary signature that preserve the class-based information. The original work [26, 28] used an autoencoder to learn such codes. Later, Weiss et al. proposed to compact a GIST descriptor [24] using spectral hashing [36]. A kind of combination of these last works was proposed simultaneously [35], consisting in learning a binary code from a GIST descriptor using several approaches, including Restricted Boltzman Machine [14]. In [17], Jegou et al. proposed to quantize several parts of a large signature such that it is compressed into a small number of integers. They also defined a distance to perform an exhaustive search in this space, showing that it is equivalent to an approximate search in the original space.

Hence, local feature based methods exhibit good performances and efficient indexing schemes have been proposed to exploit them for fast image search. However, these works usually report only searching time and the proposed methods are still too slow to compute in a “online streaming data” scenario for which the feature extraction time must be considered.

3 Proposed approach

We propose a descriptor and a search process that allow to detect near duplicate images under the media intelligence constraints related to processing time, memory size and copy detection accuracy of actual Web content. The core of the method is to extract a hash from a resized version of the image. Its construction is mostly based on pixel values comparisons and is therefore cheap to compute. Moreover, such a binary signature allows us to use the Hamming distance during the search, that is faster to compute than for example the Euclidean distance. The full descriptor is an enhancement of this hash signature, composed of additional information. A particular version of these enhancement consists to compute the same hash on the image in polar coordinates, to be more robust to rotation and scale transformations. In the following paragraphs, we describe the descriptor extraction process, common to the original image and its polar coordinates counterpart. The process of our method is illustrated on Fig. 1.

Fig. 1
figure 1

Pipeline of the process to compute the proposed signature. It is applied to an image (top) and its transform into polar coordinates (bottom). See Section 3.2 for details

3.1 Rescaling

The first step of our method consists to convert the image to grayscale and reduce its size to H × W pixels. The image can be rescaled with any interpolation technique but taking the average of neighbouring pixels seems sufficient. This step insures a quasi invariance with respect to resampling transformations (with or without keeping the original aspect ratio). It also gives a good robustness with respect to small details of the image such as small watermarks or text incrustation. The conversion to greyscale combined to the extreme resizing leads in practice to a good robustness to color transformations. To make the descriptor invariant to flip, the descriptor should be built upon symmetric comparisons, therefore W must be an even number: W = 2w, w ∈ ℕ. One can also resize the image with an odd W and ignore the central column in the following.

3.2 Hash extraction

Let consider the rescaled image I r of size H × W. Each of its line (i) is a set of spatially ordered pixels \(\mathcal {P}=(p_{1}, p_{2}, \dots , p_{2w})\) with p n being on the left of p m when n < m. Let consider two subsets \((\mathcal {P}_{x},\mathcal {P}_{y})\) such that \(\mathcal {P}_{x} \cap \mathcal {P}_{y}=\emptyset \) and:

$$ p_{j}\in \mathcal{P}_{x} \Rightarrow p_{2w+1-j}\in \mathcal{P}_{y} $$
(1)

There are 2w such couples of subsets, corresponding to \(\#\mathcal {P}_{x}\) (and \(\#\mathcal {P}_{y}\)). We select J of these couples and define the following aggregating values on each of them:

$$ {x_{i}^{j}} = \sum\limits_{p_{k} \in \mathcal{P}_{x}^{j}}{I_{r}(p_{k})} \,\,\text{and}\,\, {y_{i}^{j}} = \sum\limits_{p_{k} \in \mathcal{P}_{y}^{j}}{I_{r}(p_{k})}. $$
(2)

Considering the comparison function s defined by:

$$ s(x,y)=1-\mathcal{H}(y-x)=\left\{ \begin{array}{l l} 1 & \quad \text{if}\,\, x > y\\ 0 & \quad \text{otherwise} \end{array}\right., $$
(3)

where \(\mathcal {H}(.)\) is the Heaviside step function. We compute the hash value for line i by:

$$ \lambda(i) = \left( s({x_{i}^{1}}, {y_{i}^{1}}), \dots, s({x_{i}^{J}}, {y_{i}^{J}})\right). $$
(4)

Finally, the resulting hash for the full image I is

$$ h(I)=\left[\lambda(1), \lambda(2),\dots,\lambda(H)\right]. $$
(5)

The hash of image I, noted h(I), has thus a size of J × H bits. When J is a multiple of 8, each value of h(.) can be efficiently encoded as an integer on J bits.

The definition of the aggregating functions (Equation (2)) insures a quasi invariance to flip transforms. However, it exists H.2W/2 possibilities to compute h(.) since there are 2w possible couple \((\mathcal {P}_{x},\mathcal {P}_{y})\) and H lines. The question of the choice of the subsets is investigated in Section 4. In particular, we show that good results are obtained with H = W = J = 16 and the choices given in Table 3 for the subsets. It is compared to random choices of subset, showing nevertheless still good results as long as it contains the simplest subsets with one pixel.

3.3 Descriptor enhancement

In addition to the hash, we add information by keeping track of the mean intensity of the image m(I) and of \(\text {eq}(I) = {\sum }_{i=1}^{H}{\sum }_{j=1}^{J} \delta _{{x_{i}^{j}}, {y_{i}^{j}}}\), where δ a, b is the Kronecker delta, corresponding to the number of comparisons which lead to an equality. These values are used to assert if one image is a mirrored duplicate of another. We also have noted in our experiments that using them globally improves the performance. The values m(I) and eq(I) are coded on a single byte (8 bits) each.

Another enhancement consists to compute the previously described components, both on the original image and its transform in polar coordinates. Such a process improves the robustness to rotation and crops since these transforms induce a translation of the pixel values. The induced change in the reduced image (of size H × W) can be easily reduced by an appropriate choice of the subsets \((\mathcal {P}_{x},\mathcal {P}_{y})\) in the hash design, for instance by including several consecutive pixels in the subsets. However, this is only true of the center used to compute polar coordinates is approximately the same as that of the transform. In practice, using the spatial center of the image is often a good choice. An alternative is to determine it automatically according to the actual content of the image, for instance by the method given in [20] to obtain an invariant centroid, consisting in choosing the expectation of the luminance levels on each coordinate. It nevertheless did not improve our experimental results in practice, thus we only use the spatial center of the image to compute the polar coordinates in the following.

Let I be the original image and θ(I) the image in polar coordinate. The final descriptor is a concatenation of all the values computed and encoded:

$$ \phi(I) = \left[h(I), \text{m}(I), \text{eq}(I), h\left( \theta(I)\right), \text{m}\left( \theta(I)\right), \text{eq}\left( \theta(I)\right)\right] $$
(6)

It has a size of 2 × (J × H + 2 × 8) bits. Figure 2 shows a binary representation of the description for one image using a setting which gives a 68 bytes signature.

Fig. 2
figure 2

Right part represents the binary description ϕ(I) of the image displayed on the left using the 68 bytes descriptor. There are 2 distinct parts corresponding to the original image and the image in polar coordinates. For each representation, 3 categories are identified with different colors: the red one represents h(.), the blue one m(.) and the green one eq(.). Each black and white square within these categories represent a bit

An alternative is to use θ(I)T instead of θ(I), leading to:

$$ \phi_{T}(I) = \left[h(I), \text{m}(I), \text{eq}(I), h\left( \theta(I)^{T}\right), \text{m}\left( \theta(I)^{T}\right), \text{eq}\left( \theta(I)^{T}\right)\right]. $$
(7)

ϕ T (.) has the same size as ϕ(.). Since the hash function acts on the lines of the reduced image of size H × W, the “polar coordinates” part of ϕ(.) improves the robustness to the rotation while ϕ T (.) improves that on the zoom (crops).

We tried to design a signature able to take both effects at the same time by considering consecutive lines (in addition to symmetric columns as specified by Equation (1)) but the experimental results were not improved significantly.

3.4 Searching process

Since our descriptor is composed of hashes (i.e. it is a binary signature), we use the Hamming distance d H to compute the distance between two descriptors. d H is defined as follows:

$$ \forall a,b\in\{0,1\}^{n}, \text{d}_{H}(a,b) = \#\mathcal{D}, $$
(8)

where \(\mathcal {D} = \{i : a_{i} \neq b_{i}\}\). The Hamming distance thus counts the number of bits which differ between two values. However, in our framework, we do not compute Hamming distance on m (.) and eq (.): computing absolute difference on these values, allows us to determine if one image is a flipped (in x-axis, y-axis and both ways combined) duplicate of the other. The distance d(ϕ(I 1), ϕ(I 2)) between two images I 1 and I 2 is

$$ d\left( \phi(I_{1}),\phi(I_{2})\right) = \text{d}_{H}\left( h(I_{1}), h(I_{2})\right) + \text{d}_{H}\left( h\left( \theta(I_{1})\right), h\left( \theta(I_{2})\right)\right) + r, $$
(9)

where \(r=\frac {|\text {eq}(I_{1})-\text {eq}(I_{2})|+|\text {m}(I_{1})-\text {m}(I_{2})|}{2}\). In our implementation we use the POPCNT instruction, which is available for recent processors, to compute the Hamming distance. The distance computation is done with one processor instruction whereas naive and Look-Up Tables based implementations use several instructions. Consequently, POPCNT is faster than other implementations we tested.

The searching process can also be dramatically speeded-up with the use of an inverse-file structure. It consists in quantizing the descriptor according to a discrete codebook that allows efficient indexing and fast search. Such a codebook is usually learned with k-means, from a large representative set of vectors in the description space (thus vectors ϕ(.) resulting from our process). Then, at indexing time, each vector ϕ(I i ) is assigned to an element c j of the codebook, named a codeword. Then, at searching time, if a testing image I t is represented by the vector ϕ(I t ) and that vector is closer to c j than to any other codeword, then the searching process defined by Equation (9) will be applied to the vectors assigned to c j only, not to all the vectors of the reference database. Hence, if the vectors are equally divided among the codeword (and K-means clustering should lead approximately to such a result) the searching time is divided by the size of the codebook.

However, preliminary experiments showed that the usual process based on k-means [7] led to a dramatic drop of the performance of our method. We thus propose to learn the codebook with a k-medians instead of k-means and show experimentally in Section 4 that we obtain the intended effect with an 80 × speed-up at the price of a very moderate performance drop.

4 Experiments

As we mentioned, we focus on streamed visual data in the context of a social media intelligence scenario, in which all new images ingested by the system need to be compared with a dataset made of recent images. Experiments are carried out with two different benchmarks. The Image Copy Detection dataset is a benchmark that has been released recently and allows us to compare our method to the state of the art. We also propose our own benchmark (WebTransforms) that includes other transformations we consider important such as small rotation, flip and partial blur (e.g. on a human face) and that has been extended to a larger scale (14 to 100 million distractors). All experiments have been conducted on the same computer equipped with an Intel Xeon processor @ 2.10 GHz, 32 GB of RAM and running Linux.

4.1 Datasets and evaluation protocol

WebTransforms We propose a dataset in which thirteen transformations (see Table 1) have been applied to the 5,011 training image of the PascalVOC’07 test dataset [9], resulting into 70,154 images. We also merged our dataset into 2,000,000, 14,158,566, 85,308,719 and 100,158,012 distractor images extracted from Flickr to see how it behaves with relatively large-scale databases. A visual representation of the transformations is available in Fig. 3. We evaluate the search performance on this dataset with the recall@14, i.e. the fraction of relevant images retrieved at the top 14 positions, averaged for all the queries. By looking at rank 14, this measure is equal to precision@14 as well. We considerate this measure is more relevant than classical mean average precision when one consider a large number of distractors, since we are mainly interested in the relevance of the first retrieved document. In other words, the recall@14 criterion reflects the performance of a nearest neighbor (NN) classifier that would take its decision based on the first 14 retrieved documents. Note that the center of transforms 4, 7 and 8 are the same as that used to compute polar coordinates in Equations (6) and (7), that is to say the spatial center of the image. However, the center of transform 6 (random crop 80 %) is usually different from that used to compute the polar coordinates.

Table 1 Transformations used in the WebTransforms dataset
Fig. 3
figure 3

The first image starting from the left is the original image and the following 13 are detailed in Table 1

Image Copy Detection dataset

The authors of [34] introduced a dataset to evaluate image detection methods for searching duplicate images on the web. This dataset is composed of 6,000 query pictures which are taken at various locations in the world. Sixty transformations (Table 2) have been applied to these images leading to a total of 360,000 images. Transformations were chosen after a survey which involved 45 persons familiar with image processing who were asked to report the most common transformations they encountered when looking at images with their favorite search engine. Following the evaluation protocol from [34], we merged near-duplicate images with 2,000,000 Flickr images collection. We also compared our method with GIST descriptor [24], TOP-SURF [33] and Convolutional Neural Network (CNN) intermediate features (layer name: fc7) extracted from a 16-layer network [30] trained on ImageNet [25]. The search quality is measured by the mean average precision (mAP) computed over the 6,000 queries for each transform.

Table 2 Transforms used in the copy detection dataset

4.2 Results

WebTransforms

The evaluation is conducted with five settings of our descriptor: (I) H = W = J = 8 (20 bytes signature), (II) H = W = J = 16 (68 bytes signature), (III) H = W = J = 16 with J random couples of subsets of u to v pixels per subset, (IV) H = W = J = 16 where we pick the 8 couples of subsets containing one pixel and the others are random couples with 2 pixels and (V) H = W = J = 16 where we use ϕ T (.) to extract the descriptor. The J subsets used in (II) mode are enumerated in Table 3. The first eight couples are used for setting (I). In practice, for settings (III) and (IV) we sample the random couples of subsets using a simple procedure. First, for each unique couple \((\mathcal {P}_{x}, \mathcal {P}_{y})\) we randomly set, in the interval [u;v], the number n of pixel positions to choose. In particular, setting (IV) has n = 2. For each pixel position, we pick a random one, p j , to construct \(\mathcal {P}_{x}\) and put its symmetric counterpart, p 2w+1−j , in \(\mathcal {P}_{y}\). The J subsets used in (III) mode with u = 1 and v = 2 are enumerated in Table 4.

Table 3 Chosen configuration of subsets \((\mathcal {P}_{x},\mathcal {P}_{y})\) with H = W = J = 16 to conduct the experiments
Table 4 Configuration of subsets \((\mathcal {P}_{x},\mathcal {P}_{y})\) of setting (III) with u = 1 and v = 2

We compare our method to three descriptors: TOP-SURF [33], GIST [24], a global descriptor popular for web-scale image indexing [7] that is one of the best methods for content-based duplicate image detection [34] and deep CNN features which have demonstrated astounding results in different computer vision tasks [11, 29]. In the case of CNN features, due to their high dimensionality (4096 dimensions), we reduce them to 512 dimensions via PCA. Experiments show that even with fewer dimensions, this kind of features is still performing almost as good as uncompressed [1, 5]. We additionally binarize CNN features with iterative quantization (ITQ) [12], an unsupervised method which finds a rotation matrix that minimizes the quantization error of mapping the features to a binary representation. PCA and ITQ are computed on an independent dataset of 10,000 images. We chose to use binary codes in order to have a better comparison with our method in terms of number of bits per features, as our signature uses 544 bits or less. The results are reported in Table 5. The 68 bytes descriptor [line (b)] outperforms the 20 bytes one [line (a)] by +5.3 %. It is also slightly better than GIST [line (k)] by +0.8 %, worse than TOP-SURF [line (l)] by -0.9 % and than CNN features [line (m)] by -0.6 %. Most of the performance loss are due to severe crops (transf. 7 and 8).

Table 5 Results on the WebTransform dataset

Our descriptor reacts quite well when merged to a large-scale image collection [lines (n), (r), (s) and (t)] since the performance drops only by 4 points with 2M distractors, giving a similar performance to GIST [line (o)]. Once again, this is mainly due to severe crops, although one can observe a significant drop for rotation as well. However according to [34] rotations are not often encountered on the web. With 14 million, 85 million and 100 million distractors, the performance of our descriptor drops by respectively 2, 4.5 and 4.8 more points only. We can also observe that TOP-SURF does not react well when merged with distractors [line (p)], the performance drops by 9 points which is similar to our descriptor with 85 million distractors. CNN features react quite well with distractors but the performance gap (-6 %) is still higher than with our descriptor. We can also note that CNN features are quite robust to crops, even severe ones, and invariant to flip and resize. This can be explained by the training procedure: multiple crops are extracted per training image at multiple scales, and subsequently randomly flipped horizontally. However, CNN features lack robustness to image incrustation and text incrustation: this might be caused by the objective function used to train the network. In fact, the network is trained to recognize 1,000 classes of objects. Incrusting images, text, watermark in an image may change the final decision of the network and lead to intermediate representations that are far apart from the unmodified counterpart.

However, the main advantage of our approach is that it is much faster, as shown in Table 6. For instance, GIST and TOP-SURF require respectively 316 ms and 120 ms per query (average on 5,011 queries) to search into the 70,154 images database, while our approach needs 4 ms. On the 200 times larger database (14M images) our method requires 845 ms with a single core and scales almost linearly, requiring 53 ms (496 ms for 100M images) with 16 cores. The binarized CNN features require approximately the same amount of time (3 ms) to search into the database but the computation time associated to the description is very high on a single CPU core (> 1 second).

Table 6 Results on the WebTransform dataset with computation time

For our method, the extraction is quasi-independent on the actual image. Indeed, the only “image-dependent” step is the conversion to grayscale and resizing. Once the input reduced to a fixed H × W grayscale image and the J subsets fixed, the extraction time is independent on the input. n the same vein, the search process is an exhaustive search on the full signature and thus only depends on its size and that of the database. Hence, once the reference database fixed and the signature design chosen, the search time is the same for any query.

We conducted several supplementary experiments to find a better choice of subsets than the particular arbitrary split of setting (II). The results are reported in lines (c) to (h) in Table 5. We can clearly see that the more dense is our description i.e. the number of pixels per subset is small, the better are the performances. For instance, with random subsets comprised of 6 up to 8 pixels [line (h)] the recall@14 is 15.2 % behind a setting where random subsets have 1 up to 2 pixels [line (e)]. We can also note that the setting (IV) [line (i)] is slightly better (+1 %) in comparison with (e). Then, we believe that building our descriptor with all unitary subsets is the most important aspect of our method to achieve the strongest robustness possible. We could not find any combination of random couples able to outperform our initial intuitive choice.

The results of setting (III) on line (c) to (h) are lower than other settings for transformation 4 (rotation) and 11 (sepia). It is particularly the case when u > 1 that is to say when the setting do not use unitary subsets. For instance, when these unitary subsets are removed, the settings drop from 51.8 % [line (c)] or the sepia transformation to 27.7 % [line (d)], while it is almost the same at 51.2 % [line (f)] when the largest subset are removed. on the contrary, when the smallest subsets are kept only [line (e)] the results are very good (97 %). This is probably due to largest change in Equation (2) (average luminance) when the pixel subset is larger. A similar rationale can be conducted on the polar-coordinates part of the signature in the case of the rotation transform. For these two transformation in particular, the importance of the unitary subsets is even more highlighted.

We also include the evaluation of setting (V) [line (j)] which gives better results than other methods. The improvement over setting (II) (+1 %) is mostly caused by a better robustness to moderate crops (transform 7) and image incrustation (transform 9). Robustness to transform 7 is aligned with expectation since (V) was specifically designed to be more robust to crops. Robustness to transform 9 can be interpreted in the same vein. Indeed, the incrustation of Lena hides 25 % of the center of the image and the similarity with copies holds on the periphery. Hence, since ϕ T (.) catches the information at a given distance with respect to the image center, it make this setting particularly robust to this transform.

Image Copy Detection dataset

For this experiment we compare two settings, (II) and (V), to the methods of the state of the art. We include both speed and accuracy measures in the evaluation. Therefore we measure, for each method, the description and matching times, the precision and the average rank per duplicate for each transformation. Description and matching time and mean average precision for each method are reported in Table 7. Our descriptor is at least one order of magnitude faster than other methods.

Table 7 Results on the Image Copy Detection dataset, compared to methods at the state-of-the art

Our method with setting (II) obtains a mAP of 99.1 % and setting (V) reaches a mAP of 99.3 % over all the 60 transformations, whereas GIST, TOP-SURF and CNN features are behind with a mAP of 93.2 %, 93.7 % and 97.3 % respectively. To give further insight into the obtained results, we present average rank per duplicate in Fig. 4 and the precision-recall curve in Fig. 5. With our method, the duplicate image is, on average, ranked at the first position for a wide majority of transformations. Setting (V) works slightly better than setting (II): for several transformations the duplicate image is ranked closer with setting (V). However, for some of the transformations the mean rank is high but this is mostly due to few images ranked very far. Figure 6 shows several examples of ranked images when -50 % brightness, +100 % saturate and big menu incrustation transforms are applied. Overall, the obtained results show that our descriptor outperforms TOP-SURF, GIST and CNN features for this task both in terms of accuracy and scalability.

Fig. 4
figure 4

Average rank per duplicate (the lower, the better) for each of the transformations of the Image Copy Detection dataset. Note the logarithmic scale for y-axis. The x-axis refers to the transformation number of the dataset

Fig. 5
figure 5

Precision-recall curve of all methods for all transformations of the Image Copy Detection dataset combined

Fig. 6
figure 6

Query image is displayed on the left, transformed images (-50 % brightness, +100 % saturate, big menu incrustation) are displayed on the right for each image. The ranks of the duplicate images returned by our descriptor are displayed on the bottom right of each image. The last two queries are dark, when -50 % brightness is applied the transformed images are almost completely black, therefore our method fails to capture their structure

We also implemented an inverse indexing structure using a vocabulary of 20,000 codewords learned with k-medians as explained in Section 3.4, in order to achieve significant search time speedup while maintaining a good accuracy. In this scheme, a query x is associated to its m nearest codewords (q 1(x), q 2(x), … , q m (x)). For each codeword q(x) we compute the distance between the query vector x and each base vector y i within the list. We found that m = 200 was a satisfying trade-off between search speed and performance in our experimental results. This approach allows to compare on average each image to only 1.5 % of the database and thus to reduce the search time from 120 ms to 1.5 ms on the same hardware while keeping a mAP of 96.7 %. The \(\frac {120}{1.5}=80\) speedup is consistant with the theory that should be \(\frac {20000}{200}=100\). The difference (1.5 − 1.2 = 0.3 ms) is due to the search of the m closest centroids to the query.

5 Conclusions

We introduce a fast and robust image description which is well suited for indexing and searching through image data streams. Its compact size (< 100 bytes) and the use of a efficient Hamming distance computation allows us to extract a descriptor for one image and exhaustively search a large image collection (≈ 100M images) for duplicates in half a second on a 16-core processor. We also showed that our method can support an efficient inverse index structure, leading to a huge speed-up (80 ×) while keeping a better accuracy than recent state of the art.

We conducted several experiments to show the efficiency of our method that demonstrate that it has better performance than the state-of-the art ones, both in precision but above all regarding the efficiency. Giving a base of reference of 2,360,000 images our method supports a rate of 150 independent queries per second on a single core (with the inverse indexing structure speedup). With such performances, one can henceforth consider to detect duplicate images in a real system processing a stream of web images.

We also proposed several alternative designs of our descriptor and studied their consequences. Among them, we proposed a “best mode” which gives the best overall robustness without altering the size nor the extraction process. However, despite its good performance, the particular choice we propose as “best mode” may not be the best. Finding such an an optimal pattern to design the core of our signature will be the subject of future works.