Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Content-based image retrieval (CBIR) is a long studied topic. While there is lot of existing research on low-level features, researchers are still struggling to get a better understanding on how to represent and obtain mid-level features and high-level features. The semantic difference between low and high level features representation is commonly called as the ‘semantic gap’ and it is a challenge to fill this gap.

To achieve a semantically meaningful description of an image’s content, one of the most important steps to do is to determine the region of interest (ROI). In this research, we study the feasibility to combine saliency prediction model with object recognition to identify key objects in an image. The computational image saliency model is used to predict the focus points in the image. By using the object recognition algorithm, the content of an image is generated by combining the object and the saliency information with the low-level description of the image, we obtain a high level description of it. Later, the combined description is used to index the image.

This paper includes a literature review in Sect. 2, our framework in semantic CBIR is presented in Sect. 3. Evaluation is presented in Sect. 4. Future work and conclusions are suggested in Sect. 5.

2 Literature Review

2.1 Object Recognition and Deep Learning in CBIR

Deep learning refers to a collection of machine learning techniques where information is processed in multiple layers in hierarchical architectures. Since the successful use of the Deep Convolution Neural Networks in the image classification task in ILSVRC1-2012 [1], deep learning became state-of-the-art in computer vision, including tasks such as image classification and object recognition. There is various research based on using it, including MobileNet-SSD [2], Faster-RCNN [3] and R-FCN [4] etc. Deep learning requires a large database for training. With the development of the vast amount of multimedia source on the Internet, large amount of images annotated by users are available for the training purpose. There are datasets for competitions in object recognition such as MS-COCO [5] and Kitti [6]. Those databases also cover a wide range of themes that allow to train and test the CBIR system on various images.

Object recognition provides an effective way towards higher level description of image. Image captioning is a popular application of the technique. Combining object recognition and natural language processing, it can provide a semantic description of image, including the class and characteristics of objects in the image, and also the action of animals and human beings in the image. More attention was put on instance-level image retrieval [7]. Many of them use deep learning and user-generated data on the internet.

2.2 Image Saliency in CBIR

Image saliency is the visual attention that a human observer puts on a certain position in an image. It can be used as a measure of relative importance or meaningfulness of that position in the image such that object containing that region should be given higher weighting in the indexing process. Image saliency map is introduced as a metric for CBIR by many researchers [8]. The saliency prediction can be done using a bottom-up approach using low-level features or top-down approach by including external information such as heuristics.

3 New Model to Combine Saliency Prediction and Object Recognition

3.1 Framework

In this paper, we propose a framework (Fig. 1) for combining saliency prediction and object recognition in addition to using low level features. After preprocessing, a vector presentation of the image is generated and stored in the database. For an image query, the procedure is similar: after generation of the feature vector, the query feature vector is compared with the feature vectors stored in the database using a similarity measure. The results are sorted by similarity and top results are retrieved.

While it is similar to most of the other CBIR algorithms, the main contribution of this study is the introduction of a new model for representing feature vectors and defining the similarity measure.

Fig. 1.
figure 1

The basic framework of the proposed content based image retrieval model

3.2 Object Recognition and Bounding Box

Most of the current object recognition algorithms provide bounding boxes as output. Detection is represented by a rectangular area enclosing the object. An example is shown in Fig. 2(a). Each bounding box contains three types of information:

  1. 1.

    Classification of object: object is represented by an integer index such as ‘person’ and ‘toothbrush’ as shown in the figure.

  2. 2.

    Position of the bounding box: the position is indicated by four values and also reflects the size of the object.

  3. 3.

    Score of detection: it is a floating point value indicating the confidence of the object recognition algorithm on the classification of the object. The classification with a low score should be rejected.

Notice that a pixel can belong to multiple bounding boxes or does not belong to any of them. Besides, due to the nature of the rectangular bounding box, some irrelevant parts of the image or the background is also included in the bounding box. In this research, object recognition is done through using a deep learning network model “Single Shot Multibox Detector (SSD) with MobileNet” [9] which is pretrained with MS-COCO data [5]. It is used due to its lightweight and speed, which is crucial for the speed performance, but the accuracy is sacrificed. The current algorithm can classify 90 different categories of objects. The number of class categories depends on the data and annotation used in the training process.

3.3 Saliency Prediction

Saliency prediction model aims to predict relative intensity of human visual attention and output as a heatmap (Fig. 2). The bright region indicates the highly salient region of the image. The Itti-Koch approach is used in our research. It is a classical model based on low level features [10]. Lower level features like colour, intensity and orientations are used to derive the saliency map. When there are alternative saliency models, especially those deep learning based approach, then they are more time consuming and hence not implemented in this study.

Fig. 2.
figure 2

(a) object recognition with bounding boxes, (b) visual attention in experiment and (c) visual attention prediction by the Koch-Itti model. Image from CAT2000 database. [11]

3.4 Feature Vector

In Fig. 3, the overall architecture for forming the feature vector is summarized:

  1. 1.

    The target image is processed through the object recognition algorithm. Object bounding box (OBB) and classification of objects with high confidence (\(\ge t\)) are obtained.

  2. 2.

    The saliency map of the image is calculated based on the Itti-Koch algorithm.

  3. 3.

    The average saliency intensity within each bounding box is calculated.

  4. 4.

    The list of detected objects is sorted by the average saliency.

  5. 5.

    The top five objects in saliency intensity is retrieved.

To determine the value of t, there has to be a balance between the number of candidate objects and also the confidence level of the object detection result. Moreover, t value varies due to the content of the image. An iterative approach is used to determine t. By setting the initial value \(t=0.1\), the list of candidate objects is generated. If the number of OBBs in the image is less than the desired value (\(N=5\) in our model), the threshold value is reduced by half until enough number of objects are detected using the model.

Fig. 3.
figure 3

Calculation of image feature vector

In our model, we have decided to choose the five most salient objects to form the primary feature vector. After the retrieval of the five highest salient objects, their category indices are used to form the primary feature vector, which provides the semantic representation of the image. The order of object in the primary feature vector is important as it shows the relative importance of the objects. However, ordering requirement can be too restrictive and it will be discussed in Sect. 5.

The object recognition algorithm provides semantic interpretation of the object. However, it misses out some important low level description of the object, including colour and size. Low level description is extracted within each bounding box to describe low level features of objects.

  1. 1.

    Colour: the RGB value of the pixel in the image is converted into CIELAB colour space. The average value of all pixels in the three channels gives the average L, A and B value of the object.

  2. 2.

    Size: the height and the width of the bounding box determine the relative size of the object in the image.

After all, there are five feature elements for each object (L,A,B, height and width). Combining with the primary feature vector (5 elements) and 5 object feature vectors, there is a 30-element image feature vector. The graphical representation is shown in Fig. 4. These feature vectors are generated both for the query and the database images.

Fig. 4.
figure 4

Graphical representation of the image feature vector

3.5 Similarity Measure

To retrieve similar images, we have to compare the candidate image feature vector \(v_{c}\) and query feature vector \(v_{q}\). A distance metric \(D\left( v_{c},v_{q}\right) \) is defined and the comparison is done in our model in three steps:

  1. 1.

    Comparison of primary vector: check if the object category agrees between the candidate and the query. Notice the order of object matter in this case. For example, the comparison between the primary vectors \(\left[ 1,2,3,4,5\right] \) and \(\left[ 2,3,4,5,1\right] \) results in element-wise disagreement for all elements. For each element-wise disagreement between the primary feature vectors, a penalty P is added to D. The sum of the penalty is denoted as \(P\left( v_{c},v_{q}\right) \).

  2. 2.

    Comparison of object feature vector: For each element-wise matching between the primary feature vectors, the distance metric of that matching is calculated with:

    $$ D\left( o_{c}^{i},o_{q}^{i}\right) =\varDelta L^{2}+\varDelta a^{2}+\varDelta b^{2}+\alpha \left( \varDelta W^{2}+\varDelta H^{2}\right) $$

    where \(o_{c}^{i}\) and \(o_{q}^{i}\) are the \(i^{th}\) object feature vectors of the candidate and the query images respectively \(\varDelta L\) is the difference in L channel \(\varDelta a\) is the difference in A channel \(\varDelta b\) is the difference in B channel \(\varDelta W\) and \(\varDelta H\) is the difference of width and height respectively.

  3. 3.

    Adding step 2 and 3 then we get the distance metric:

    $$ D\left( v_{c},v_{q}\right) =\sum _{i}D\left( o_{c}^{i},o_{q}^{i}\right) +P\left( v_{i,}v_{2}\right) $$

As we can see from the definition, the most similar image with the query image is the query image itself as \(D\left( v_{q},v_{q}\right) =0\) . The result is sorted by the distance metric values. The value of P and \(\alpha \) are empirically set to 10000 and 0.001.

3.6 Implementation

The implementation of the model is done in Python 3.5 with Tensorflow. Here is an example retrieval result shown in Fig. 5.

About the indexing time, we can observe that the indexing time is linearly increasing with the number of images as shown in Fig. 6 except for one outlier. The existence of outliers may be due to the threshold adjustment process. Although the image comes from the same database, there are variation of computation complexity between different catalogues.

Fig. 5.
figure 5

Example of the retrieval result in COREL database. The first image (leftmost) is the query image, which is always the first retrieval result. The others are the 2nd–5th retrieval results. Green solid bordered image represents a successful retrieval when the wrong retrieval are bounded by red dashed border.

Fig. 6.
figure 6

Number of images vs Indexing time

4 Evaluation

To study the performance of the system, a subset of COREL image database [12], with 10 image categories, each containing 100 images, was used to evaluate the system. The themes of the different categories were distinct to prevent ambiguity between categories. A retrieved image is considered as successful only if it belongs to the same category as the query. The mean average precision [13] is shown in Table 1. For comparison, we also include the average precision in 100 retrievals in our model and previous result with SIFT-LBP [14]. We also show two of the results using the ROC and precision-recall curve shown in Fig.7. Overall speaking, the results are worse than the previous result, especially the recall of the system is not very good. The precision is also quite low. This can be explained by the following reasons:

  1. 1.

    Limitation of object recognition algorithm: The object recognition algorithm plays an important role in the pipeline. If the class of object is not correctly determined, a heavy penalty is imposed, disregarding the similarity in low level features. For example, the category ‘Horses’ provides a much better result compared with other categories. In those categories, the main objects (the horses) are usually correctly determined. On the other hand, the category ‘Mountains and glaciers’ gives very poor result as most objects fail to be recognized. The quality of retrieval depends strongly on the object recognition algorithm, the training images and also the annotation used in the training.

  2. 2.

    Naive matching in primary feature vector: if an object exists in both query and candidate primary vector but in different ranking, they are still considered as mismatches. This causes high amount of mismatch in the comparison. In fact many of the images fail to match the query primary vector at all, resulting in maximum penalty (50000). These image results cannot be ranked so they are meaningless and rejected in the retrieval process. That is the reason that the maximum number of retrievals is very low in certain cases.

  3. 3.

    Parameters: There are three important parameters in the study: the number of bounding box, the penalty parameter P and \(\alpha \). These parameters are just empirically obtained and not optimized. Moreover, their optimal values are subject to various factors, including the size and context of the images.

Table 1. Mean average precision in COREL subset data and average precision comparison with SIFT-LBP approach
Fig. 7.
figure 7

(a)Retrieval example with query image from dataset ‘Horses’. (b)Retrieval example with query image from dataset ‘Mountains and glaciers’.

5 Future Work and Conclusions

The first improvement is to allow for a matching of the object in different ordering. Objects in the same category in the candidate primary feature vector and the query primary feature vector should be matched by reordering. When there are multiple possible matches, the match should be chosen to minimize \(D\left( v_{c},v_{q}\right) \). Though this increases the complexity of the problem, it can improve the results. Besides, semantic segmentation [15] is a developing field in computer vision. It can be applied in our framework instead of the object recognition algorithm. The main advantage is to avoid the overlap area between bounding boxes of the objects, and it can avoid also multiple counting on the same object (for example, multiple bounding box on the same person). Thus the accuracy of the system can be improved.

Thus, in this research, we explored the possibility to combine object recognition algorithm and image saliency prediction model with low-level features in CBIR. Although there is a significant drawback in the current technology, this framework can shed light on the direction for developing CBIR algorithms that are more semantically meaningful.