Keywords

1 Introduction

Web image search and retrieval has become an increasingly important research topic due to abundance in multimedia data on internet. Most of the web scale image search engines mainly use two schemes for searching for images on the web. In the first, keyword based scheme, images are searched for by a query in the form of textual keyword provided by the user. The second, example based scheme allows the users to search for similar images by providing an example serving as query image both schemes are illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of the keyword based and example based image search scheme

Independent of which search scheme is deployed, an image search engine generally operates in two main steps: the offline and the online step. For many query keywords the image retrieval performance is good, but the precision of the returned results is still relatively low. They suffer from the ambiguity of query keywords, because it is difficult for users to accurately describe the visual content of target images only using query keywords. One of the major challenges is the conflict between the content of the image and the webpage textual information. This paper attempts to resolve this bottleneck by depending on both the textual information and visual information. Another major challenge in the existing systems is that its similarities of low level visual features may not correlate with image’s high level semantic meanings. To reduce this semantic gap, visual features are mapped to a predefined attributes known as visual semantic descriptors.

In this paper we propose a web image search approach, which requires both query keyword and query image. First a text based search is performed by using a query keyword. From the pool of images retrieved, the user is asked to select the query image. Images in the pool are re-ranked based on the visual semantic descriptors of the query image. This query-specific visual semantic descriptor effectively reduces the gap between low-level visual features and semantic categories, and makes image matching more consistent with visual perception.

Another major issue in web image search is removing near-duplicate images. We address the diversity problem to make the search result more diverse and to improve efficiency of the search engine. Identifying distinct images can prevent duplicate images from being used once they are uploaded. In this paper we use Perceptual hash method to remove duplicate images and improve search diversity. Image features are used to accomplish this and experimental results show the better improvement of diversity in the search result. Encouraging results are obtained for proposed image re-ranking method on a real world image dataset collected from commercial search engines.

The rest of this paper is organized as follows. We discuss related works in Sect. 2 and follow up with the details of our method in Sect. 3 and present evaluation results in Sect. 4 and finally conclusions and future works are offered in Sect. 5.

2 Related Work

Most of the internet scale image search engines like Google Image Search and Bing primarily depend on the textual information. The annotation error due to the subjectivity of human perception is one of the major disadvantage of this approach. Text based image search suffers from the ambiguity of query keywords [1]. Content based image retrieval (CBIR) was introduced in early 1980s, to overcome the limitations of text based image search. Visual features of images are used in CBIR to evaluate image similarity.

Generally there are three categories of visual re-ranking methods classification based, clustering based and graph based. Classification-based methods first select some pseudo-relevant samples from the initial search result. Deng [2] learned to produce a similarity score for retrieval by using a predefined comparison function based on a known hierarchical structure. In clustering based methods the pool of images in the initial search result are first grouped into different clusters. According to the cluster conditional probability, order the clusters and re-ranked result list is created. By using the cluster membership value order the samples within each cluster. Graph based methods are more recently proposed and have received increased attention. In this method first a graph is built with the images in the initial search result serving as nodes. If two images are visual neighbors of each other, an edge is defined between those images and the edges are weighted by the visual similarities between the images. For instance, either a random walk over the graph or an energy minimization problem, re-ranking can be formulated.

Cui [1, 3] classified query images into eight predefined adaptive weight categories, inside each category a specified pre-trained weight schema is used to combine visual features. For reducing the semantic gap, query specific semantic signatures was first proposed in [4]. The proposed visual semantic descriptor is effective in reducing the semantic gap when computing the similarities of images. One shortcoming of the existing system [1, 3, 5] is that sometimes duplicate images show up as similar images to the query. The proposed system overcomes this problem by adding a perceptual hash method to detect duplicate images. Our work incorporates both textual and visual information for re-ranking, moreover incorporation of textual information and duplicate detection significantly improves the search result and was not considered in previous work.

3 Method

We use both query keyword and query image for searching the images. The images for re-ranking are collected from different search engines like Google, Bing etc. User first submits a query keyword, from the pool of images returned after text based search, user is asked to select a query image which reflects the user’s search intention. Based on the visual similarity with the query image, images in the pool are re-ranked. The semantic classes associated with query keywords are discovered by expanding the query keywords from the result obtained from the text based image search by utilizing both text and visual features. This expanded keyword defines the semantic classes for the query keywords. For each semantic class the training images are collected. A multilayer perceptron on the text and visual features is trained from the training sets of the semantic classes and its output is stored as visual semantic descriptor which describe the visual content from different aspects.. We combine the described features together to train a single classifier, which extracts a single semantic descriptor for an image and the extracted descriptors are stored. These semantic descriptors are used to compute the similarities for re-ranking. It provides much better re-ranking accuracy, since training the classifiers of reference classes captures the mapping between the visual features and semantic meanings [5]. The Flowchart of our approach is shown in Fig. 2.

Fig. 2
figure 2

Overall block diagram

3.1 Feature Design

We adopt a set of features which describe the visual content of an image in different aspects. For reducing the semantic gap both high level features and low level features are used to capture the user’s search intention. Here we briefly explain the features used in this work.

  • Color: The most widely used color feature used in image retrieval is color. One color description that is good to dispense efficiency and effectiveness in illustrating the distribution of colors of an image is color moment [6]—Mean, standard deviation, skewness. Mathematically those moments are defined as follows [7].

  • Texture: Texture is one of the major feature used in recognizing objects [8]. Here we use co-occurrence matrix for image texture analysis.

  • Attention Guided Color Signature: The color composition of an image describes color signature [9]. Clusters centers and their relative proportions are taken as the signature.

  • Histogram of Gradient (HoG): HoG reveals distributions of edges over different parts of an image, and is especially effective for images with strong long edges [10].

  • Wavelet: We use the 2nd order moments of wavelet coefficients in various frequency bands to characterize the texture properties in the image [11].

3.2 Expanding the Keywords and Collection of Training Images

The keyword expansions most relevant to the query keyword are identified to define the semantic classes associated with each query keyword (For example, for a query keyword ‘diamond’ some of the semantic classes identified are black diamond, diamond jewelers, red diamond etc.). To find the keyword expansions first a text based search is performed and the top ranked images are retrieved from the text based search. Keyword expansions are found from words extracted from the top retrieved images. The T most frequent words \(W_{1} = \left\{ {w_{I}^{1} ,w_{I}^{2} , \ldots ,w_{I}^{T} } \right\}\) among top D re-ranked images are identified and these words are used for expanding the query keyword. The extracted words are sorted according to the frequency of words become available among the D images from higher to lower. If a word w is among the top ranked image, it has a ranking score r i (w) according to its ranking order. Otherwise ranking score must be zero [5].

$$r_{I} (w) = \left\{ {\begin{array}{*{20}l} {T - j} & {w = w_{I}^{j} } \\ 0 & {w \notin W_{I} } \\ \end{array} } \right.$$
(1)

To obtain the training images of semantic classes, we combine the words obtained from the keyword expansion with original query keyword and it is used as the query to the search engine. From the search result top K images are taken as the training images for the semantic classes. To reduce the computational cost remove the similar semantic classes this is both semantically and visually similar.

3.3 Perceptual Hash Method

To identify the duplicate images here we use perceptual hash method. Duplicate images significantly reduce the storage space and decrease the search diversity. Perceptual image hash functions create hash values based on the image’s visual appearance. This function computes similar hash values for similar images, whereas for different images dissimilar hash values are calculated. Finally, using a similarity function to compare two hash values, it decides whether two images are different or not. The different steps involved in the method are as follows:

  • Reduce size: Method starts with a small image. 32 × 32 is a good size; this step is done to simplify the DCT (Discrete Cosine Transform) computation.

  • Reduce color: The image is reduced to grayscale just to further simplify the number of computations.

  • Compute the DCT: DCT disparate the image into a collection of frequencies and scalars, here uses a 32 × 32 DCT.

  • Reduce the DCT: While the DCT is 32 × 32, to represent lowest frequencies in the image keep the top-left 8 × 8.

  • Calculate the average value: Mean DCT value is computed.

  • Further reduce the DCT: Depending on whether each of the 64 DCT values is above or below the average value set the 64 hash bits to 0 or 1.

  • Construct the hash: Set 64 bits into a 64-bit integer. To understand what this fingerprint looks, set the values (based on whether the bits are 1 or 0 this uses +255 and −255) and convert from the 32 × 32 DCT (with zeros for high frequencies) back into the 32 × 32 image. Construct the hash from each image to compare two images, and count the number of bit positions that are different. (Hamming distance) A distance of zero indicates that it is very similar pictures thus identifies the duplicate images and eliminate it thereby improving the search results.

3.4 Combining Features and Visual Semantic Descriptor Extraction

The above described features characterize the images from different perspective of color, shape and texture. The objective of using a visual semantic descriptor is to capture the visual content of an image. Here we combine all the visual features to train a single multilayer perceptron better distinguishing reference classes. The output of the classifier is taken as the visual semantic descriptor. If S semantic classes for a query keyword, classifier on the visual features is trained and it outputs an S dimensional vector v, which indicates the probability of the image belonging to different semantic classes and it is stored as the visual semantic descriptor of the image. The distance between two images I a and I b are measured as the L1 distance between their semantic descriptors v a and v b ,

$$d(I^{a} ,I^{b} ) = \left\| {v^{a} - v^{b} } \right\|_{1}$$
(2)

4 Experimental Results

From different search engines, the images are collected for testing the performance of re-ranking. Averaged top m precision is used as the evaluation criterion. Top m precision is defined as the proportion of relevant mages among top m re-ranked images. Averaged top m precision is obtained by averaging over all the query images. The averaged top m precisions on dataset are shown in Fig. 3. The improvements of averaged top 10 precisions on the 10 query keywords on dataset by comparing text and visual based methods are shown in Fig. 4. Here we choose 10 semantic classes for a single query keyword and only one classifier is trained combining all types of features, so the semantic descriptors are of 10 dimensions on average. For a particular semantic class fifty images are selected, so for a single query keyword 500 images are used. For training, 75 % of the images were used and remaining 25 % were utilized for testing.

Fig. 3
figure 3

Averaged top m precisions on different methods

Fig. 4
figure 4

Improvements of averaged top 10 precisions on the 10 query keywords

In this work we combine all the extracted features to train a single classifier to reduce the computational overhead caused when training separate classifiers for each feature. The dimensionality of extracted features is very large, so in the existing systems more computational cost is for comparing the visual features of images when re-ranking. Here we overcome this computational overhead by the use of visual semantic descriptors which have very less dimension compared to the visual features. The use of perceptual hashing method increases the diversity in search result by removing the duplicate images. The proposed work significantly outperforms the text based and visual based methods which directly compare visual features. The averaged top 10 precision is enhanced from 53 % (visual based) to 74 %. 21 % relative improvement is achieved.

5 Conclusion and Future Work

We propose a Web image search approach which requires both query keyword and query image. Previous methods only use either textual information or example input by the user. Visual semantic descriptors are proposed to combine the visual features and to compute the visual similarity with the query image. User intention is captured by both textual and visual information. The proposed image re-ranking framework incorporates duplicate detection to identify the duplicate images to the query image which make the search result more diverse. In the future we plan extend this paper for video event recognition [12]. To further improve the quality of re-ranked images, we aim to combine this work with photo quality assessment work in [13].