1 Introduction

On the Internet, copying and forwarding images is very easy, not only providing convenience for users [1], but also bringing numerous security problems [2, 3]. For example, in March 2013, one media outlet reported that the transgenic corn products supplied by the Monsanto Company might cause cancer. To prove its credibility, the report cited partial CCTV screenshots. This led to a large number of reprints and mass panic. However, it was later confirmed that the report was an old news story from September 2012 whose conclusions had been denied by the Supreme Council of the French biotechnology. Recently, using old or irrelevant images on microblogs to spread rumors has become increasingly rampant. Hence, tracking and verifying images’ sources has become essential.

In the process of tracking and verifying image sources, the most critical problem is large-scale duplicate image detection. Generally speaking, “duplicate” is a precise term. Namely, data content is exactly same in a duplicate. However, for practical multimedia applications, “duplicate” is more focused on external form rather than data content, so the definition of “duplicate” also includes different copies of an image. These copies come from a set of tolerable transforms. By analyzing microblog images, we find that there are three main tolerable transformation types:

  1. 1.

    Scaling transformations: An image is scaled or stretched. This situation is very common in microblog images.

  2. 2.

    Watermark transformations: The same image has different watermark information added. For example, when users upload images to Sina Weibo or Sohu Weibo, the microblog ISPs automatically add watermark information to images.

  3. 3.

    Storage format transformations: An image is stored in different formats, such as JEPG, PNG, or TIFF. The changes in image content are small.

For more complex transformations such as light transformations, rotation transformations, and other optical or geometric transformations, the occurrence probability in microblog images is very small, and hence, they are not considered in this article.

The existing literature has provided a number of duplicate image detection methods. Although these methods are able to reasonably deal with their own design scenarios, for very large image collections, those methods suffer from either intensive computation complexity or degraded performance [4]. So they have difficulty satisfying real-time responsiveness and accuracy in large-scale image retrieval. For example, Changick [5] used an ordinal measure of discrete cosine transformation (DCT) coefficients to detect duplicate images. The method inherited good resistance to noise attack and horizontal or vertical flip attacks. Li et al. [6] again proposed a robust image copy detection algorithm based on an ordinal measure of the full DCT coefficients. Because the middle- and low-frequency coefficients of the full DCT transformation were changed regularly by a scaling transformation, their proposed method could resist scaling attack. However, when describing images, using a single feature tends to miss some detailed information. Hence, these methods’ universality is not strong. In this case, in order to detect more duplicate images, we have to relax the retrieval threshold, which reduces retrieval accuracy. For large image collections, smaller errors lead to larger error return. This will reduce user experience. Meanwhile, existing algorithms are mainly based on the single-machine environment, and it is difficult to directly extend them to distributed environments. Therefore, for large-scale microblog images, this study proposes a real-time duplicate image detection method based on multi-feature fusion. This method has the following advantages:

  1. 1.

    By multi-feature fusion, this method uses spatial structure features, color features, and texture features from multiple angles to filter non-duplicate images. This can effectively improve retrieval accuracy.

  2. 2.

    Through its Hbase design, this method uses a bloom filter and range query to ensure real-time performance of hadoop clusters. A bloom filter is used to eliminate partially extended signatures, which can reduce disk IO. Range query is used to replace prefix query, which can effectively reduce the amount of data loading with Hbase as well as the number of comparisons of the next filtering stage.

2 Related work

At present, according to different features, content-based copy detection (CBCD) can be divided into three research directions:

2.1 Global feature-based CBCD

Global features extract global statistics information to represent images. This can be divided into color features, shape features, texture features, and spatial structure features. The advantages of global features are their simple calculation and less space required. However, due to focusing on the overall information of images, global features tend to ignore local information in images.

The first near-duplicate image search engine was RIME [7]. RIME used Daubechies’ wavelet transformation to extract image features and a multi-dimensional extensible hashing scheme to create a high-dimensional index. This method could effectively detect slightly modified images, but it faced difficulty in identifying severely distorted images. Subsequently, Wu et al. [8] proposed an ellipse track-division-based image copy detection method. However, due to the ellipse shape without the rotation invariance property, this method did not really solve the rotation distortion problem. On this basis, Zhou et al. [9] proposed a cirque division strategy to detect duplicate images. It was well known that the content of a cirque track region was almost invariant under rotation and scale attacks; therefore, the proposed method could effectively resist the above two attacks. However, it could not solve the circle shift issue caused by non-proportional scaling transformations. In addition, Hesiao et al. [10] used an extended feature set to detect image copies. The method estimated in advance all possible transformations by prior knowledge and then searched the transformed images one by one. However, in real life, it was not only difficult to estimate all transformation types, but the extended feature sets incurred additional computational costs, as well. This would affect the real-time performance of the system. In recent years, people gradually realize a single feature is difficult to ensure retrieval accuracy. So Feng et al. [11] proposed a novel image descriptor, called global correlation descriptor (GCD), to extract color and texture feature, respectively, so that these features have the same effect in CBCD. For addressing the problem of large-scale image search, Jegou et al. [12] propose aggregating local image descriptors into a global vector. The experiment shows that the image representation can be reduced to a few dozen bytes while preserving high accuracy.

2.2 Local feature-based CBCD

Local feature-based CBCD first detects the stable regions of an image and then extracts high-dimensional feature vectors in the vicinity of each region. Hence, the image is represented as a set of feature vectors. With respect to global features, local features are more suitable to working with local changes in image content. However, local features have difficulty distinguishing similar images and duplicate images. Meanwhile, the matching algorithms of local features are more complex and face difficulty in meeting the real-time requirement in large-scale image retrieval.

The landmark work on local features was the scale-invariant feature transform (SIFT) feature-based method [13]. The SIFT had good local invariance, and it received widespread attention in academia. On its basis, many scholars proposed improved algorithms for the SIFT feature-based approach. In order to improve discrimination and robustness, Mikolajczyk and Schmid [14] proposed the gradient location orientation histogram (GLOH) feature. To improve illumination invariance, many scholars began to study color SIFT, which let SIFT features from the grayscale space extend to the color space [15]. Although the above methods could solve their respective problems, they need a large amount of time overhead. Therefore, to reduce computational complexity, Ke and Sukthankar [16] proposed PCA-SIFT features. However, experiments showed that the discrimination of PCA-SIFT decreases with decreasing feature dimensions. Bay et al. [17] also proposed an accelerated algorithm for SIFT features: the speed-up robust feature (SURF). Experiments showed the extraction rate of the SUFT feature was three times faster than that of the SIFT feature [18]. Although the above algorithms had good scale invariance and rotation invariance, they had difficulty resisting affine transformations. Given this situation, according to “watershed” concept, Matas et al. [19] proposed the maximally stable extremal regions (MSER) detection algorithm. In addition, the fast and compact binary robust independent elementary feature (BRIEF) was gaining more and more attention [20]. The advantage of BRIEF was its computation speed, but BRIEF lacked the rotation invariance and scale invariance properties. Therefore, the oriented brief (ORB) [21] was proposed by Rublee et al. Although ORB also lacked the scale invariance property, in practice the problem could be solved by heuristic strategies. Experimental results showed the calculation speed of ORB to be 100 times faster than that of SIFT and 10 times faster than SURF. Therefore, ORB was suitable for processing real-time video data. Combined the characteristics of the human retina and Daisy descriptor [22], Alahi et al. [23] propose Fast Retina Keypoint (FREAK), which has a density distribution of the human retina. Yang and Cheng [24] propose a highly efficient and distinctive binary descriptor, called local difference binary (LDB). LDB directly computes a binary string for an image patch using simple intensity and gradient difference tests on pairwise grid cells within the patch. A multiple-gridding strategy and a salient bit selection method are applied to capture the distinct patterns of the patch at different spatial granularities. Compared with other descriptors, LDB has higher robustness and computational efficiency.

2.3 Perception hash-based CBCD

Computational complexity is the key problem in large-scale image retrieval. A perception hash can solve the problem by compressing image features into binary signatures. Binary signatures generated by a perception hash have two advantages: They can preserve the similarity of the original feature space, and they can greatly improve computational efficiency. However, because of containing less image information, binary signatures have difficulty in ensuring retrieval precision.

At present, perception hashes can be divided into three categories: unsupervised hashes, supervised hashes, and semi-supervised hashes. Unsupervised hashes refer to the generation process of perception hashes that do not need training data. The typical representatives of an unsupervised hash are LSH [25], minhash [26], and simhash [27]. However, a large number of experimental results show that measuring similarity is not able to guarantee semantic similarity. In real life, images often contain a large number of tagged data; in order to take full advantage of these existing semantic data, a supervision hash is proposed. Compared with unsupervised hashes, supervised hashes have higher retrieval precision so that, in recent years, supervised hashes have had greater development. The typical representatives of supervised hashes are similarity-sensitive coding (SSC) [28], restricted Boltzmann machines (RBMs) [29], self-taught hashes (STH) [30], nonnegative spare coding-induced hashes (NSCIH) [31], and histograms of sparse codes (HSC) [32]. Although supervised hashes can preserve the semantic similarity of images, when the amount of tagged data is too small, it will lead to over-fitting phenomena. Therefore, in order to solve this problem, Wang et al. [33] proposed a semi-supervised learning hash (SSH). And Bauml et al. [34] propose a unified learning framework for multi-class classification which incorporates labeled and unlabeled data.

3 Algorithm description

Duplicate image detection implies finding images that have the same visual perception but different codes. At present, due to the existence of semantic gaps, the accuracy of similar matching is unable to reach 100 %. In large-scale image retrieval, smaller errors lead to larger error return. This will significantly affect the user experience. Therefore, in order to improve retrieval accuracy, this section selects a perception hash, a block-average grayscale feature, and a Haar wavelet feature to implement multi-feature fusion. The features can effectively compensate for each other. The fusion process divides into two phases: the rapid filtration stage and the precise filtration stage.

3.1 Rapid filtration stage

A large-scale image retrieval system is put forward for higher real-time and scalable requirements. Therefore, a rapid feature comparison method is needed to meet system requirements.

3.1.1 Random projection as perception hash

In this stage, this study uses a random projection algorithm as a perception hash to implement real-time responsiveness and scalability. A random projection algorithm is used to generate binary signatures for images, and comparison of signatures can be used to judge the similarity between two images.

A random projection algorithm can preserve the similarity of data. Namely, similar data in high-dimensional space have smaller Hamming distance in Hamming space. The dissimilar data in high-dimensional space have larger Hamming distance in Hamming space. The advantage of this property is that the computation of binary signatures is simple and fast, and signatures are easy to compare and expand. However, the drawback is lack of accuracy.

The specific algorithm process as follows:

  1. 1.

    Image feature extraction

    Firstly, the input image is converted into grayscale K, whose size is scaled to M × M. Then, K is evenly divided into n blocks. The average grayscale value v i of each block K i is computed, and a block-average grayscale feature V K  = (v 1v 2, …, v n ) is generated. If the gray value of the image pixel is represented by g(xy), then formula (1) is established.

    $$ v_{i} = \frac{n}{{M^{2} }}\sum\limits_{{x,y \in K_{i} }} {g(x,y)} .$$
    (1)
  2. 2.

    Image signature generation

Definition 1

Random projection hash h(V): a nonzero vector X = (x 1x 2, …, x n ) is randomly selected from n-dimensional space, where each dimensional component x i is drawn from the standard normal distribution N(0, 1). Considering the angle between the feature vector V and the vector X, if it is an acute angle, then h X (V) = 1; otherwise, h X (V) = 0. If the inner product of vectors is presented by °, then formula (2) is established as follows:

$$ h_{X} (V) = \left\{ \begin{aligned} 1,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} V \circ X \ge 0 \hfill \\ 0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} V \circ X \prec 0{\kern 1pt} \hfill \\ \end{aligned} \right. $$
(2)

Definition 2

Random projection signature sig(V): the vectors X 1X 2, …, X f are randomly generated, where f ≤ n. Then, the inner products between V and X i are calculated. If V ° X i  ≥ 0, then the ith hash bit is \( h_{{X_{i} }} (V) = 1; \) otherwise, \( h_{{X_{i} }} (V) = 0, \) namely

$$ {\text{sig}}(V) = h_{{X_{f} }} (V)h_{{X_{f - 1} }} (V) \cdots h_{{X_{1} }} (V) $$
(3)

Theorem 1

In a random projection signature, the appearance probability of 0 or 1 is equal and independent [35].

Proof

Suppose that V is a vector in n-dimensional real space, then we have the following conclusions.

Independence

If S i  = V ∘ X i , X i  = (x i,1x i,2, …, x i,n ), and ∀x i,j  ∼ N(0, 1), then S i  = ∑  n j=1 v j  × x i,j and v j  × x i,j  ∼ N(0, v 2 j ). According to the additivity of normal distributions, we can infer that random variable S i  ∼ N(0, ∑  n j=1 v 2 j ) = N(0, δ 2). This indicates that each component generated by the random projection algorithm belongs to a normal distribution whose expectation is 0 and whose variance is ∑  n j=1 v 2 j .

For any two variables S p and S q , the covariance is cos (S p S q ) = E(S p  × S q ) − E(S p ) × E(S q ). Here S p and S q belong to a normal distribution whose expectation is 0. Therefore, \( \text{cov} (S_{p} ,S_{q} ) = E(S_{p} ) \times E(S_{q} ) = E((\sum\nolimits_{i = 1}^{n} {v_{i} x_{p,i} } ) \times (\sum\nolimits_{j = 1}^{n} {v_{j} x_{q,j} } )) \). According to the distributive law of multiplication, \( \text{cov} (S_{p} ,S_{q} ) = E(\sum\nolimits_{i = 1}^{n} {\sum\nolimits_{j = 1}^{n} {v_{i} v_{j} x_{p,i} x_{q,j} } } ) = \sum\nolimits_{i = 1}^{n} {\sum\nolimits_{j = 1}^{n} {v_{i} v_{j} E(x_{p,i} x_{q,j} )} } \). Here x p,i and x q,j belong to the standard normal distribution and are independent. Therefore, \( \text{cov} (S_{p} ,S_{q} ) = 0 \).

Equality

By definition 1, we can obtain \( h_{X} (V) = {\text{sign}}(S) = {\text{sign}}(V \circ X) = \left\{ \begin{array}{l} 1,\quad V \circ X \ge 0 \hfill \\ 0,{\kern 1pt} \quad V \circ X \prec 0{\kern 1pt} {\kern 1pt} \hfill \\ \end{array} \right. \). Here sign() represents a sign function. Combined with the conclusion of independence, S i  ∼ N(0, ∑  n j=1 v 2 j ) = N(0, δ 2), we can infer formulas (4) and (5). Namely, the appearance probability of [0, 1] is equal in a random projection signature.

$$ \Pr (h_{X} (V) = 1) = \Pr ({\text{sign}}(S) = 1) = \Pr (S \ge 0) = \frac{1}{2} $$
(4)
$$ \Pr (h_{X} (V) = 0) = \Pr ({\text{sign}}(S) = 0) = \Pr (S \prec 0) = \frac{1}{2} $$
(5)

Theorem 2

In a random projection signature, the exactly equal probability of f bits of signatures is \( \Pr (s) = (1 - \frac{\arccos (s)}{\pi })^{f} \) , where s represents the similarity of two vectors.

Proof

Suppose that two n-dimensional vectors are U and V. As shown in Fig. 1, if the angle between U and V is θ, then only when the random vector X i falls in the intersection angle between the normal vectors U and V, there is \( h_{{X_{i} }} (U) \ne h_{{x_{i} }} (V) \). At this time, the unequal probability of the corresponding signature bit is \( \frac{\theta }{\pi } \). Combined with Theorem 1, the exactly equal probability of an f-bit signature is \( \Pr (s) = (1 - \frac{\theta }{\pi })^{f} \). According to the cosine similarity principle \( s = \frac{U \circ V}{\left| U \right|\left| V \right|} = \cos \theta \), we can infer \( \Pr (s) = (1 - \frac{\arccos (s)}{\pi })^{f} \).

Fig. 1
figure 1

A random projection

As shown in Theorem 2, Pr(s)is a monotonically increasing function of s. The character of Pr(s) can keep the similarity of data. Namely, similar data in high-dimensional space have smaller Hamming distance in Hamming space. Therefore, if two vectors have higher similarity, then the probability of having equal hash values is higher.

  1. 3.

    Image signature comparison

    Image signatures generated by random projection may be affected by noise. Therefore, in order to improve retrieval recall, this paper not only considers the situation where image signatures are exactly equal, but also considers the situation where image signatures are similar. Namely, when the Hamming distance of two image signatures is not greater than the threshold H, we think the two images may be duplicates.

    $$ D_{\text{ham}} ({\text{sig}}(U),\,{\text{sig}}(V)) = \sum\limits_{i = 1}^{f} {\left( {h_{{X_{i} }} (U) \oplus h_{{X_{i} }} (V)} \right)} \le H $$
    (6)

3.2 Precise filtration stage

From the point of view of containing information, the original image contains the richest content, but the information is hidden in the overall structure of the image and cannot be directly understood by computers. Therefore, features must be extracted to describe images. However, in each process of feature extraction, there is a certain degree of information loss, which makes it difficult for retrieval accuracy to meet user needs. Here the information loss of a perception hash is particularly obvious. Therefore, a more sophisticated exact match is indispensable. In this stage, this study uses a block-average grayscale feature and a Haar wavelet feature to conduct multiple filtering. Multiple filtering from the angle of color and texture can effectively improve retrieval accuracy.

3.2.1 Block-average grayscale feature

Due to hash collisions, non-duplicate images may fall into the same bucket using a perception hash; namely, they may have the same signature. Therefore, perception hashes can only play the function of fast indexing and preliminary filtering. We need to further filter non-duplicate images. The analysis reveals, using a block-average grayscale feature, that the same dimension values of duplicate images are very close. On this basis, we use the Manhattan distance to calculate the similarity of two images.

$$ D_{\text{man}} (V_{X} ,V_{Y} ) = \sum\limits_{i = 1}^{N} {\left| {V_{{X_{i} }} - V_{{Y_{i} }} } \right|} $$
(7)

Here V X and V Y represent the block-average grayscale features of images X and Y. \( V_{{X_{i} }} \) represents the ith component of a block-average grayscale feature V X . If D man(V X V Y ) is not greater than threshold T, then we think images X and Y may be duplicates.

3.2.2 Haar wavelet feature

The block-average grayscale feature uses the intermediate results of the perception hash, so the processing speed is very fast. However, the processing units of the block-average grayscale feature are blocks, which ignore a great deal of detailed information. Therefore, this paper uses a Haar wavelet feature to filter further. An important character of the Haar wavelet is that it not only can reflect the global information of images, but also reflects the local information of images.

The specific algorithm is as follows:

  1. 1.

    The input image is scaled to 64 × 64 pixels and converted to grayscale.

  2. 2.

    Haar wavelet decomposition is used on the grayscale image. Sixty elements with the largest absolute values from the image matrices are extracted.

  3. 3.

    Sixty elements are replaced by their one-dimensional subscripts (V[ij] = i × 64 + j). Meanwhile, if an element is less than 0, the corresponding one-dimensional subscript is multiplied by −1. Then, the one-dimensional subscripts are sorted, and the left 10-element sequence of sorted subscripts is selected to generate a Haar wavelet feature.

  4. 4.

    When the Manhattan distance of feature vectors is not greater than threshold t, we think the images are duplicate images.

3.3 Algorithm analysis

3.3.1 Image grayscale information analysis

As shown in Fig. 2, image transformation can lead to image pixel changes. This will affect the matching of similar images. However, the analysis shows that changes in pixels are small for the above transformation. Especially for the block-average grayscale value algorithm, the calculation of the average grayscale value of a region could effectively offset the effects of individual pixel changes and thus reduce the effect of random noise.

Fig. 2
figure 2

Analysis of grayscale information of sample image

In addition, before and after transformation, the change in block-average grayscale values is small. Through random projection, images with similar space–color structures are mapped to the same bucket, and images with dissimilar space–color structure are mapped to different buckets. Therefore, this hash algorithm can effectively filter most non-duplicate images and keep just parts of the candidate images. This can greatly improve algorithm efficiency; however, hash collisions are difficult to avoid, which cause non-duplicate images to be mistakenly identified as duplicate images. Therefore, the block-average grayscale value algorithm must be used to implement further precision filtration.

It can be seen that images can be rapidly classified by a perception hash. This can effectively reduce the burden of the next phrase and improve algorithm efficiency. Meanwhile, based on the previous stage, the block-average grayscale value algorithm can improve retrieval accuracy through further comparison, and the block-average grayscale feature uses the intermediate results of the perception hash. The processing speed for this approach is very fast. Therefore, the combination of two algorithms enjoys the effect of complementary advantages.

3.3.2 Image edge information analysis

As shown in Fig. 3, for the above transformations, the image edge information shows almost no change. Therefore, the images can be effectively described by extracting their edge detail information. Haar wavelet transform is currently considered to be an effective edge information extraction algorithm.

Fig. 3
figure 3

Image edge information analysis

Haar wavelet features and block-average grayscale features have useful complementary advantages. This is because, on one hand, Haar wavelet transform can extract image edge information by multi-scale analysis, which has nothing to do with the original resolution of images. This is a more detailed characterization of image texture. Block-average grayscale features lack description of detailed information. On the other hand, Haar wavelet transform is based on points to capture image features. However, in fact, the smooth boundaries of natural objects result in the main compositions of images not being points, but rather lines and surfaces. Therefore, Haar wavelet transform shows significant limitations in dealing with images. However, the block-average grayscale algorithm is exactly the opposite. It deals with surfaces; hence, block-average grayscale features can effectively make up for Haar wavelet features.

3.4 Effect analysis

In order to provide the reference to the practical application, this section analyses the effect of multi-feature fusion.

Figure 4 shows the process of multi-feature fusion. Here images A, B, C, and D are test data, and B is forged by using the block-average grayscale feature of image A. In perceptual hash filtration stage, although the four images are different, images A, B and images C, D have the same color structure, respectively. After filtration, images A, B, C, and D are divided into two groups and each group is encircled by the red dash. Because signature comparison uses a hash map, the time complexity of this stage is O (1). In block-average grayscale feature filtration stage, although images C and D have the same color structure, the color of the corresponding block has larger difference. So they are easy to be distinguished by the block-average grayscale feature. Meanwhile, images A and B have the same block-average grayscale feature. So in this stage, they cannot be distinguished. In term of time, this stage exploits the intermediate results of the previous stage. So the speed is relatively fast. In the worst case, the time complexity of this stage is O (num); num is the number of images. In Haar wavelet feature filtration stage, images A and B have different texture. So they can be easily distinguished by Haar wavelet feature. In the worst case, the time complexity of this stage is O (num). From the overall perspective, due to the good cohesiveness and complementarities between multi-features, this algorithm can not only maintain a high recall rate, but also meet the accuracy requirements of duplicate image detection. Additionally, as shown in Table 1, with increasing number of images, the whole time complexity increases linearly.

Fig. 4
figure 4

Multi-feature fusion

Table 1 Time complexity

4 System architecture

In order to ensure the real-time performance of large-scale image retrieval, this paper designs a distributed system architecture based on hadoop. In the design, hadoop clusters provide a distributed parallel processing environment. Hadoop distributed file system (HDFS) is the foundation of a whole structure in support of the superstructure. In the superstructure, Hbase is mainly used to manage image information and image features. The system architecture is shown in Fig. 5. In offline part, images are stored in the underlying hadoop clusters. Image features are extracted by MapReduce and stored in Hbase as given in Table 2. In online part, when a query image comes, the client firstly extracts the image signature and image norm of the query image in rapid filtration stage. Then, extended signatures of the query image are generated by changes in image signatures within H bits. The comparison between the extended signatures and image signatures in Hbase can be used to judge the similarity between two images. Here, image signatures can play the function of fast indexing and preliminary filtering. And in order to improve the retrieval efficiency of Hbase, this paper uses a bloom filter to exclude partly extended signatures. This process can reduce the positioning time of a perception hash. Meanwhile, according to formula (11), this paper uses a range query to replace the individual scanning of prefix queries, which can effectively reduce the amount of data loading and the number of comparisons in the next filtering stage. In precise filtration stage, due to hash collisions, we need block-average grayscale feature and Haar wavelet feature to further filter non-duplicate images.

Fig. 5
figure 5

System architecture

Table 2 Hbase design

4.1 Hbase design

In the rapid filtration stage, to improve retrieval recall, if the Hamming distance between test images and the query images is not greater than the threshold H, then the test images are seen as candidate images for the next filter stage. For a large number of candidate images, there is a key issue of how to return query results within a tolerable amount of time. A simple global scan is time-consuming and thus cannot meet practical needs. Therefore, we need to design and optimize Hbase according to the application. A specific Hbase structure is shown in Table 2. Here, in primary key, region presents the partition’s number of image signatures. Image signature uses hexadecimal representation to improve retrieval speed. Image norm is generated by calculating the 1-norm of a block-average grayscale feature. Count is the number of images, which is used to avoid duplicating prefix fields.

According to Hbase design, the changes in image signatures within H bits are seen as extended signatures. We can then search the signatures using the prefix query. Because just partial data are being dealt with rather than global scanning, this method can effectively improve query efficiency. For example, when H = 2 and f = 32, we need C 032  + C 132  + C 232  = 529 prefix queries. Here, it can be seen that, with increasing H and f, the number of lookups increases dramatically.

4.2 Optimization design

The Hbase prefix query can effectively improve query efficiency. However, in practical applications, image signatures are not uniformly distributed; for the same image signature, there may be many corresponding images or none. So, with increasing H and f, it is difficult for using the simple prefix search to guarantee the real-time performance of the system. In order to solve this problem, this study implements two aspects of optimization: On the one hand, a bloom filter is exploited to eliminate partly extended signatures in order to reduce the number of signatures searched. On the other hand, for two images having the same signature, we use a range query to replace the prefix query. This can decrease the amount of data loading for Hbase in the next filtering stage.

4.2.1 Bloom filter

The bloom filter is a data structure used to determine whether an element is within a collection. This technology was first applied to disk access and later became widely used in database queries, intrusion detection, and other applications.

In this section, before each range query, we first compare image signatures with the storage bit of the bloom filter. If the corresponding bit is 1, it means there may be a corresponding image with the same signature; then, we need to further access Hbase. Otherwise, it means that there are no corresponding images with the same signature, in which case the query is skipped. By using the bloom filter, we can effectively remove part of the extended signatures and decrease disk IO time.

4.2.2 Range query

For any two image features X and Y, when their Manhattan distance is not greater than the threshold T, we believe that they are similar. Formula (8) is established.

$$ D_{\text{man}} (X,Y) = \sum\limits_{i = 1}^{N} {\left| {x_{i} - y_{i} } \right|} \le T $$
(8)

According to the nature of algebra, X and Y can be regarded as points in space and have formula (9).

$$ \left| {\sum\limits_{i = 1}^{N} {\left| {x_{i} } \right| - \sum\limits_{i = 1}^{N} {\left| {y_{i} } \right|} } } \right| \le \sum\limits_{i = 1}^{N} {\left| {x_{i} - x_{j} } \right|} $$
(9)

From formulas (8) and (9), we can deduce formula (10). Formula (10) guarantees that if two images are similar, then the corresponding 1-norms are relatively close. Therefore, sequential scans can be transformed into range scans.

$$ \sum\limits_{i = 1}^{N} {\left| {y_{i} } \right|} - T \le \sum\limits_{i = 1}^{N} {\left| {x_{i} } \right|} \le \sum\limits_{i = 1}^{N} {\left| {y_{i} } \right|} + T $$
(10)

Although a perception hash can preserve the similarity of the data, there is a large amount of information loss in the process of generating image signatures, leading to a large number of dissimilar images having the same signature in Hbase. A prefix query based on image signatures will return a large number of irrelevant images along with their information. This would incur a large amount of data loading and increase the number of data comparisons in the precise filtration stage. Therefore, for images with the same signature, it is necessary to do further filtering. Namely, according to formula (10), the prefix query is replaced by a range query for the same signature.

The specific query process is as follows: When the input image A is provided, we first calculate the 1-norm ‖A1 and all image signatures of the query vector A. We use the bloom filter to eliminate partly extended signatures. Then, according to the remaining signatures, we, respectively, calculate the query range \( [{\text{region\_sig}}(A){\text{\_low limit}},\,{\text{region\_sig}}(A){\text{\_high}}\,{\text{limit}}] \). Namely, starting with \( {\text{region\_sig}}(A){\text{\_low}}\,{\text{limit}} \) and ending with \( {\text{region\_sig}}(A){\text{\_high}}\,{\text{limit}} \), the sequential scan is carried out in Hbase. Finally, the relevant information for query results is returned to proceed to the next filtering.

$$ \left\{ \begin{array}{l} {\text{region\_sig}}(A){\text{\_low}}\,{\text{limit}} = \left\| A \right\|_{1} - T \hfill \\ {\text{region\_sig}}(A){\text{\_high}}\,{\text{limit}} = \left\| A \right\|_{1} + T \hfill \\ \end{array} \right. $$
(11)

5 Experimental results and analysis

In order to verify the proposed method, this study conducts a large number of related experiments, including: (1) parameter estimation. In order to improve the identifiability of image features, we analyze experimental results to select the most suitable parameters. (2) Algorithm comparison. In order to verify the effect of this method, we compare the proposed method with other classic algorithms. (3) Large-scale image retrieval. The purpose of these experiments is to verify the real-time and scalable performance of this method.

Because duplicate images in microblogs are difficult to count, the experiment is divided into two aspects:

  1. 1.

    To ensure the accuracy of retrieval results in a single-machine environment, we randomly selected ten thousand images from the Corel image database as test images to estimate parameters and compare algorithms. Here, Corel images include a large number of similar images. This can help us to select the proper thresholds to distinguish similar images and duplicate images. An inverted index was used as the index structure.

  2. 2.

    In a distributed environment, in order to more closely approach the real situation, this paper downloads 12 million images as test images from Netease, Sohu, Sina, and other portals. For 12 million images, we construct hadoop clusters to evaluate the real-time and scalability of large-scale image retrieval.

To generate copies, we selected 50 images from the above two test image sets as query images, and each query image was processed by a scaling transformation, watermark transformation, and storage format transformation using Photoshop. Here, the number of the scaling images was 400. The number of watermark images was 100. The number of storage format transformation images was 100.

Here, hadoop clusters included 4 hosts. One was the master, and the others were slaves. The configurations of the four hosts are as follows: Intel(R) Xeon(R) CPUs (E5645) @ 2.40 GHz with 32 GB RAM and 600 GB of hard disk space.

5.1 Parameter estimation

The values of H and T chosen in the experiments are relevant to block-average grayscale features of images saved on hadoop. The value of t chosen in the experiments is relevant to Haar wavelet features of images saved on hadoop. Because the test images are randomly selected from the Internet, the three values chosen are only relevant to the selected features and irrelevant to the scenarios. For different features, the selection of parameter values can be reference to the following method.

5.1.1 The choice of parameter H

In order to improve anti-jamming capability for image signatures, this study obtained the optimal threshold H through a receiver’s operating characteristic (ROC) curve. The ROC curve represents the relation between recall and precision under different thresholds H. As shown in Fig. 6, when H = 2, we have the largest recall rate. Analyzing the undetected duplicate images, the main factors affecting the recall rate were threshold T and t. Continuing to increase H, the impact on the recall rate was not only very limited, but also led to the decline of the precision rate and the increase in retrieval time, so we chose H = 2.

Fig. 6
figure 6

Threshold H

5.1.2 The choice of parameter T

To obtain the optimal threshold T, a precision–recall (PR) curve was used. Intuitively, the ideal duplicate image detection algorithm should be able to distinguish similar images and resist image transformations. However, in fact, due to the influence of the semantic gap, precision and recall are checks and balances. As shown in Fig. 7, when the threshold T varies between 8 and 10, precision and recall achieve a better balance. We selected T = 9.

Fig. 7
figure 7

Threshold T

5.1.3 The choice of parameter t

$$\left\{ \begin{array} {l}{\text{fr}} = \frac{{{\text{the}}\,{\text{number}}\,{\text{of undetected copies}}}}{{{\text{the number }}{\kern 1pt} {\text{of total copies}}}} \hfill \\ {\text{cr}} = \frac{\text{the number of detected non{-}copies}}{{{\text{the}}{\kern 1pt} {\text{number}}\,{\text{of}}\,{\text{non{-}copies}}}} \hfill \\ \end{array} \right. $$
(12)

When threshold t was too large, it was difficult for it to play a role in distinguishing similar images, which reduced the precision rate. However, when threshold t was too small, it led to some duplicate images being perceived as non-duplicate images, which reduced the recall rate. Therefore, we need to choose an appropriate threshold t according to experimental results. This study obtained the optimal threshold t through the ROC curve, which represents the relation between false rejection (fr) and correct rejection (cr) under different thresholds t. As shown in Fig. 8, we selected t = 8. The definitions of false rejection and correct rejection are given in formula (12).

Fig. 8
figure 8

Threshold t

5.2 Comparison of algorithms

In order to verify the effectiveness of this method, we compared it to Changick’s methods [5], Li’s methods [6], Feng’s method [11], and Jegou’s method [12]. The four methods are classic algorithms for image retrieval and have good robustness to local geometric distortion and scaling transformation.

As shown in Fig. 9, the proposed method has two advantages: (1) with the same precision rate, the recall rate of the proposed method is closest to the ideal value (100 %). (2) When the precision rate increases, the recall rate of the four methods decrease sharply, whereas the recall rate of the proposed method decreases slowly. Hence, this method gives better performance.

Fig. 9
figure 9

Algorithm comparison

5.3 Large-scale image retrieval

5.3.1 Scalability analysis

The average retrieval time under different numbers of images was tested. As shown in Fig. 10, with increasing number of images, the average retrieval time increased linearly. When the number of images was 12 million, the average retrieval time was 0.7 s. With the expansion of hadoop clusters, the average retrieval time can be further reduced. Therefore, this system has good scalability.

Fig. 10
figure 10

Comparison of average retrieval time under different numbers of images

5.3.2 Algorithm optimization

The effect of the bloom filter (BF), prefix query (PQ), range query (RQ), and multiple threading (MT) was tested. As shown in Fig. 11, for 12 million images, the average retrieval time of the basic algorithm was 20.3 s. By adding the bloom filter, the average retrieval time shortened to 17.2 s, and the efficiency was improved by 15.3 %. Meanwhile, with the increase in data volume, Hbase distributed these data to each node, and the nodes were divided into different regions. Therefore, the use of multiple threading was shown to improve average retrieval time. However, the most obvious performance improvement was with the range query. The range query greatly reduced the amount of data loading in Hbase and the number of comparisons in the next filter, so the average retrieval time was only 0.72 s.

Fig. 11
figure 11

Algorithm optimization

5.3.3 Retrieval effect

As shown in Table 3, the experimental results indicate that, when H = 2, T = 9, and t = 8, the recall rate of the proposed method was 93.8 %; the precision rate was 99.1 %. With further adjustment of threshold T and t, the actual retrieval precision of this method could approach 100 %, while the decrease in the recall rate is limited. This situation demonstrates that the proposed method has a higher accuracy.

Table 3 Retrieval results

6 Conclusion

For tracking and verifying image sources, this paper described a real-time, large-scale duplicate image detection method based on multi-feature fusion. The proposed method used multi-feature fusion to improve retrieval accuracy. Here, multi-feature fusion uses a perception hash to generate image signatures, reducing the search time of images, and uses block-average grayscale features and Haar wavelet features to conduct multiple filtering, thereby improving the retrieval accuracy. Then, using Hbase design, this method utilized the bloom filter and range query to improve the retrieval efficiency. Here, the bloom filter was used to eliminate partly extended signatures, reducing disk IO time. The range query was used to replace individual scanning of the prefix query, effectively reducing the amount of data loading and the number of comparisons required. Experimental results show that, compared to existing algorithms, for scaling transformations, watermark transformations, and storage format transformations, the proposed method exhibited a higher precision rate and recall rate. Meanwhile, the method’s real-time responsiveness and scalability also met actual needs. Future work will mainly focus on more complex image transformations, especially cropping transformation.