A robust CBIR framework in between bags of visual words and phrases models for specific image datasets

Ouni, Achref; Urruty, Thierry; Visani, Muriel

doi:10.1007/s11042-018-5841-8

A robust CBIR framework in between bags of visual words and phrases models for specific image datasets

Published: 12 March 2018

Volume 77, pages 26173–26189, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

A robust CBIR framework in between bags of visual words and phrases models for specific image datasets

Download PDF

335 Accesses
7 Citations
Explore all metrics

Abstract

One objective of the Content Based Image Retrieval research field is to propose new methodologies and tools to manage the increasing number of images available. Linked to a specific context of small expert datasets without prior knowledge, our research work focuses on improving the discriminative power of the image representation while keeping the same efficiency for retrieval. Based on the well-known bag of visual words model, we propose three different methodologies inspired by the visual phrase model effectiveness and a compression technique which ensures the same effectiveness for retrieval than the BoVW model. Our experimental results study the performance of our proposals on different well known benchmark datasets and show its good performance compared to other recent approaches.

Improving the Discriminative Power of Bag of Visual Words Model

Enhanced bag of visual words representations for content based image retrieval: a comparative study

Article 27 May 2019

Robust Visual Vocabulary Based On Grid Clustering

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Content Based Image Retrieval (CBIR) has been an active field in the last decades, and is still active as the amount of data available increases exponentially. This phenomenon has highlighted the needs for efficient and effective tools to manage this amount of data. One important topic from CBIR field is the construction of the image signature. Indeed, image signature is at the core of any CBIR system. An accurate and discriminative signature will improve the precision of the retrieval process. It will also help to bridge the well-known semanticgap issue between low-level features and the semantic concepts a user perceived in the image. Figure 1 illustrates the overall flowchart of CBIR systems. The research work proposed in this paper focuses on the image signature construction step.

In this paper, we focus on “expert image datasets”, which are of interest for domain expert end users. The expert end-users can be clinicians, historians, digital curators, numismatists, etc. Those expert datasets may have quite heterogeneous contents (e.g. historic images of persons, constructions, etc.) or more specific contents (e.g. datasets of old coins, butterflies, etc.). The particularities of such datasets have to be considered in the indexing tools that will help them to manage their data for further image exploitation stages as retrieval or browsing. Moreover, improving the image representation will benefit other domain applications as action or activity recognition applications in computer vision [22, 23].

Our study, led in the framework of a cultural heritage project, aims at indexing expert image datasets of limited sizes (around 10000 images). These datasets are very specific and dedicated to the cultural heritage domain with very little or no prior knowledge on each image.

Among the numerous state-of-the-art approaches that have tried to “narrow down” the semantic gap, some of them have improved the descriptive power of visual features [35], improving gradually the existing local and global image descriptors, while others have proposed effective ways to use, mix and optimize the use of these features. Among them, the bag of visual words model (BoVW) [8, 32] has become a reference in CBIR. The BoVW model represents images as histograms of visual words, enhancing the retrieval efficiency without losing much accuracy.

More recently, some researchers have stated that the BoVW discriminative power was not enough and have proposed to construct visual phrases or bags of bags of words by structuring visual words together using different means. However, Bag of visual phrases models [27, 30, 38] are computationally expensive.

Many new learning algorithms and architectures are currently being developed using deep neural networks, in order to further enhance their performances and/or effectiveness. Since the AlexNet architecture that achieved the ImageNet 2012 milestone [19], current developments for object classification include the development of very deep networks such as VGG16/VGG19 [31] and MSRANET [13], the development of wider deep networks, e.g. by adding Inception modules in the architecture [15, 34] and the development of wider and deeper, yet efficient, networks such as ResNet [12], Inception V4 + ResNet [33] and Identity ResNet [14]. When semantic information is available, a wide spectrum of cross-modal deep learning approaches exists in the literature [37].

Because of their very numerous parameters to adjust, learning based methodologies including deep learning models generally require massive amounts of annotated data to be trained, which might be very costly and even almost impossible in some applications. As previously mentioned, our objective is to index small and dedicated expert datasets of cultural heritage with no prior knowledge on the data. Thus, we concentrate our study on improving the image representation without using learning algorithms, inappropriate in our case.

Thus, this paper presents our n-Bag of visual words framework, denoted n-BoVW, which increases the discriminative power of the BoVW model. Our framework is unsupervised and has no need of semantic annotations to create the image representation. It is an extension of the framework we initially proposed in [26], with a new image representation proposal and a complete study of the framework through extensive experiments. n-BoVW uses the idea of visual phrases by selecting multiple visual words to represent each key-point but keeps the efficiency of BoVW model with a compression algorithm. Three methodologies are proposed and combined with the BoVW model to obtain our final image representation. Our experimental results on different “generic” benchmark datasets highlight the potential of our proposal.

The remainder of this article is structured as follows: we provide a brief overview of bag of visual words and phrases related works in Section 2. Then, we explain our different proposals in Section 3. We present the experiments on 4 different datasets and discuss the findings of our study in Section 4. Section 5 concludes and gives some perspectives to our work.

2 State of the art

We present in this section a brief overview of unsupervised approaches the literature of CBIR field that is linked to the BoVW model proposed by Csurka et al. [8]. Its inspiration comes from the Bag of Words model [11] of the Information Retrieval domain. BoVW model contains four main parts in its retrieval framework. For all images, feature detection and extraction has to be done. These two steps detect a list of key-points with rich visual and local information and convert this information into a vector. Many visual descriptors have been created, among them the Scale Invariant Feature Transform (SIFT) [24] and Speeded-up Robust Features (SURF) [5] became two of the most popular descriptors. A recent comparative study details the more recent visual descriptors [7]. Then, an off-line process extracts the visual vocabulary, a set of visual words, using a clustering algorithm on the set of visual features. Finally, each key-point of each image is assigned to the closest visual word of the vocabulary. Thus, each image is represented by a histogram of visual word frequencies, i.e. the image signature.

Inspired by the BoVW model, Fisher Kernel [28] or Vector of Locally Aggregated Descriptors (VLAD) [17] have met with great success. The first approach proposed by Perronnin and Dance [28] applies Fisher Kernels to visual vocabularies represented by means of a Gaussian Mixture Model (GMM). VLAD has been introduced by Jégou et al. [17] and can be seen as a simplification of the Fisher kernel. The idea of VLAD is to assign each key-point to its closest visual word and accumulate this difference for each visual word.

Recently, some researchers have focused on improving the discriminative power of the BoVW model. Thus, they have proposed to construct visual phrases or groups/bags of bags of words. Among them, we can cite the work of Yang et al. [38] or Alqasrawi et al. [2] who have used the spatial pyramid representation [20] to construct visual phrases from words spatially close or co-occurring in the same sub-region. They obtained good results for classification purposes. Ren et al. [30] have extended the BoVW model into Bag of Bags of Visual Words. They have proposed an Irregular Pyramid Matching with the Normalized Cut methodology to subdivide the image into a connected graph. Jiang et al. [18] have tried to overcome the problem of visual phrase burstiness, i.e. repetition of visual phrases influencing the image similarity measure. They have proposed a framework combining geometry constraints and probabilistic filtering. Other researchers have chosen to mix several vocabularies with different image resolutions as Yeganli et al. [39].

Most of those methodologies, by considering more meaningful combinations of words, reach a better effectiveness than the original BoVW model. However, this improved performance can only be reached at the cost of efficiency, as the processes for extracting/matching word combinations are generally quite costly. Our proposal tends to avoid this efficiency loss while keeping high performance. Moreover, compared to the state-of-the-art approaches using bag of visual phrases, we propose to improve the description of each keypoint with its neighborhood, whereas BoVP approaches linked key-points together. We also remind the reader that our proposal is unsupervised which makes it useful for expert datasets of different applicative contexts.

3 Approach

In this section, we present our global framework before detailing our contributions. Motivated by our specific context of small expert datasets without prior knowledge on the images, our main objective is to improve the image representation. Our unsupervised approach increases the discriminative power of the image signature without losing much efficiency in the retrieval process. Our main objective is to propose an alternative to supervised approaches, including deep learning approach, with close performance.

Thus, we use an CBIR framework without any filtering on image, refining process or data augmentation methodology on the used visual features nor the constructed vocabularies. It insures the reproducibility of our results.

Figure 2 presents the different steps of our global framework. In the top part of Fig. 2, we find the detection and extraction steps for each image of the dataset. Then, using the visual vocabulary constructed previously, we use four different ways to construct the image signature. First “column” is the classical BoVW model that gives for each image a histogram of visual word frequencies as signature (which will be binarized to be combined). As most CBIR systems using the BoVW model are similar, we have a standard off-line learning process to construct the visual vocabulary on a separate dataset.

The other “columns” are our first contribution, an approach we denote n-Bag of Visual Words (n-BoVW). n-BoVW selects n visual words from the vocabulary to represent each detected key-point by a visual phrase. Our motivation in affecting n words to each selected key-point is to increase the discriminative power of the BoVW model. Three different methodologies have been studied: (i) selecting the n closest visual words from a key-point (second “column” in the image), (ii) clustering n nearest key-points together in the visual feature space to obtain a list of n visual words, one word by key-point inside the small cluster and (iii) a list of n visual words taken from the cluster of the nearest visual word from the key-point. Compared to state-of-the-art approaches that used Bags of visual phrases grouping key-points of a similar region of object (e.g. [18, 30]), our different proposals give multiple words to each key-point improving its discriminative power.

Indexing images with a list of visual words by key-point, i.e. a visual phrase, is more complex than assigning a single word by key-point, particularly for the retrieval process which needs to compute the similarity between sets of visual phrases instead of a word frequency histograms. Thus, for the three proposals, our second contribution is to propose a compression process that ensures an efficient retrieval. During this process, the list of visual phrases is reduced to a simple histogram of frequencies similarly to the BoVW histogram. A final combining step is also proposed to construct the final signature of the image. This step mixes the obtained histograms to improve the discriminative power of the image signature. The following subsections detail these proposals. Note that the retrieval step is common to most CBIR systems: we compute the nearest neighbors of a query image using a similarity metric between the image signatures.

3.1 n-bag of visual words methodologies

As mentioned in Section 2, visual phrases group visual words together to be more discriminative. Our first contribution presents three different methodologies to better describe or represent each key-point by n visual words. Note that visual phrase models from the literature usually take n words from different key-points with the objective to better represent the near sub-region. Our approach differs as we aim at providing a more precise description of each key-point using a small vocabulary size.

The first methodology we propose is to select n visual words from the same visual vocabulary to represent each key-point of an image, referred as n-BoVW₁. Let W denote the vocabulary of v visual words vw₁,...,vw_v constructed using an offline process on a separate dataset. Let KP_i be the set of key-points extracted by the detection step for image i, with kp_ip the p-th key-point of image i. For each kp_ip, we compute the Euclidean distance (dL2) between the key-point and each visual word vw_j from the vocabulary W.

$$ dL2(kp_{ip}, w_{j}) = \sqrt{\sum\limits^{dim}_{d = 1}(f_{ip_{d}} - vw_{j_{d}})^{2}}, $$

(1)

where $f_{ip_{d}}$ is the d-th value of the extracted visual feature f of dimension dim for kp_ip.

Then, W is sorted according to these distances in order to pick the n nearest visual words from kp_ip. Thus, for each key-point kp_ip, we obtain a visual phrase vp1_ip, i.e. a set of n distinct visual words ${vw1_{1_{ip}}, ..., vw1_{n_{ip}}}$. An example is given Fig. 3 with n = 2 for better understanding. We can see kp₂ two nearest visual words are vw₁ and vw₆, thus kp₂ is represented by the visual phrase (vw₁, vw₆), similarly, kp₈ and kp₅ are represented by (vw₅, vw₃) and (vw₁, vw₇) respectively.

With this first methodology, we ensure the description of a key-point by a visual phrase of n distinct words. However, it never takes into account the possibility that the key-point could be better represented by only one and unique word.

Our second proposal, referred as n-BoVW₂, is based on this non-possibility we mention and also the fact that a key-point description could be a bit noisy, thus it is interesting to look at his surrounding directly in the descriptor (or visual feature) space. A bit as strong clustering algorithm works, we gather the nearest key-points in the visual feature space to form a strong choice of visual words. Each of those selected key-points is then linked to only one visual word. Note that, the probability to have nearest visual features represented by similar visual words is high. So, it allows the possibility to have only one representative visual word for the small cluster of key-points.

For each key-point kp_ip, we compute the Euclidean distances with KP_i, i.e. the other key-points of image i. These distances are then sorted in order to retrieve the n nearest key-points in the visual feature spaces (including the current key-point itself). This set of nearest key-points NKP_ip is then used to select the representative visual words. For each key-point of NKP_ip, its nearest visual word is calculated using the L2-distance. At the end, for each key-point kp_ip we also obtain a visual phrase vp2_ip, i.e. a set of n visual words ${vw2_{1_{ip}}, ..., vw2_{n_{ip}}}$ with a high probability of duplicates. An example is given in Fig. 4. We can see kp₂ nearest key-point is kp₅, both key-points are represented by vw₁, so we have only vw₁ to represent kp₂ which is different from first method. However, kp₈ with the link to its nearest neighbor kp₇ is still represented by the same visual phrase (vw₂, vw₃), and kp₅ (link to kp₆) is represented by (vw₁, vw₇).

Our last proposal, referred as n-BoVW₃, links a key-point to a set of visual words that are inside the same cluster. For this part, we cluster the vocabulary into small groups of n visual words. This set of visual words represents a larger region of the vocabulary space than only one word. The first step is to create the visual phrase vocabulary VP using a clustering algorithm. To do so, we tried both a hierarchical cluster analysis (HCA with complete linkage) and a space-partitioning data structure, the Kd-tree [1]. Note that our framework mostly needs small clusters, so we select half of the vocabulary size as the number of clusters as a good starting point. Tuning this parameter is out of the scope of this paper. Of course, some clusters may only contain 1 word, while others two, three or more words. The next step is to assign a visual phrase from VP to each key-point by selecting the nearest visual words and its associated neighbors.

Figure 5 gives an example of this proposal. Both key-points kp₂ and kp₅ are linked to the visual word vw₁. vw₁ and vw₇ belong to the same cluster C₁. Thus, the visual phrase (vw₁, vw₇) represents those two key-points. If we look at kp₈, the nearest word is vw₃ (as in BoVW model) which is in the same cluster C₄ than vw₅ and vw₂. Thus, all possible visual phrases ((vw₃, vw₅), (vw₅, vw₂), (vw₃, vw₂)) using those three visual words will be added to the image representation. This pedagogical example shows that this proposal also differs from BoVW, and the first approaches (n-BoVW₁ and n-BoVW₂). This difference is amplified with real data.

3.2 Compression algorithm

The main disadvantage of obtaining visual phrases from the proposed methodologies is the number of phrase possibilities, i.e. $\frac {v!}{(v-n)!n!}$ which is computationally too high for a retrieval system. To deal with this phenomenon, we propose a compression algorithm that is useful for all of our proposals.

We first noticed in literature approaches that visual phrases of only 2 words give better performance [2, 38]. So, for each key-point visual phrase vp_ip of n words, we construct all possible combinations of 2 visual words. Then, we also observed that for BoVP model approaches with a high number of phrases, only the presence or the absence of the visual phrase is enough to be discriminative. Thus, we decide to binarize the presence of visual phrases in one image. The final step of our compression methodology sums the presence of a word in distinct visual phrases.

The results of our proposal is an histogram of v bins as image signature which is similar to the BoVW model. However, in our approach the histogram contains the frequencies of a word appearing in distinct visual phrases extracted at each key-point.

Algorithms 1 and 2 give the global approach with the binary based compression method. Some part of those algorithms are more detailed to be easier to understand but are obviously optimized in our real code.

Of course, the information gathered from the methodologies described previously is different, even from the standard BoVW model. Thus, it is relevant to try and combine the histograms from BoVW model and all n-BoVW methodologies. Out of different solutions we have tried, concatenating the descriptor histograms of visual phrases after the binary based compression process has given the best results, even for a concatenation of visual features. However, trying to find and tune the best mixing algorithm is not in the scope of this paper and could be part of our future work. Our experimental results will study the performance of each proposal separately (n-BoVW₁, n-BoVW₂, and n-BoVW₃) and then combine them by concatenation (e.g. n-BoVW_1.2 for the concatenation of first two proposals).

4 Experimental results

In this section, we present the experiments done to highlight the potential of our approach. The goal of CBR system is to retrieve most similar images from a query one. To evaluate our different propositions, 3 well known and commonly used low level visual features, Speeded-up Robust Features (SURF), Color Moment Invariant (CMI) and Histogram of oriented gradients (HOG), and 4 state-of-the-art retrieval datasets were considered:

University of Kentucky Benchmark :: which has been proposed by Nistér et al. [25] is referred as UKB to simplify the reading. UKB contains of 10200 images divided into 2550 groups, each group consists of 4 images of the same object with different conditions (rotated, blurred...). The score is the mean precision over all images for 4 nearest neighbors.
INRIA Holidays :: [16], referred as Holidays, is a collection of 1491 images, 500 of them are query images, and the remaining 991 images are the corresponding relevant images. The evaluation on Holidays is based on mean average precision score (mAP) [29].
Corel1000 or Wang :: [36], referred as Wang, is a collection of 1000 images of 10 categories. The evaluation is the average precision of all images for first 100 nearest neighbors.
Caltech101 :: [21], referred as Caltech, is a collection of 101 categories containing between 40 and 800 images each. The evaluation is the average precision of all images for first 100 nearest neighbors. Note that Caltech is more used as a classification benchmark.

To construct the initial visual vocabulary, we used the PASCAL VOC 2012 [9] containing 17225 heterogeneous images categorized into 20 object classes. We use a visual vocabulary of 500 words for each descriptor. We have not selected a bigger dataset for the vocabulary learning (and obtain probably better results) because we simulate the cultural heritage context of small expert datasets.

4.1 Performance of n-BoVW

First, we study the effect of the parameter n. Figure 6 shows the performance of the retrieval on the 4 datasets using the n-BoVW clustering proposal with a selection of only 2 descriptors for readable purpose (CMI and SURF). It clearly indicates that n > 2 has very little interest (n = 3 is slightly better on Caltech dataset). Similar observations have also been noticed with the HOG descriptor and also for other methodologies: sometimes, for n > 2, the precision is stable, sometimes it drops little by little. This result is similar to literature visual phrases results where most approaches construct visual phrases of 2 words [2, 38]. Thus, we decide to focus our study with n = 2 even if n = 3 could give little improvement with a specific dataset and descriptor.

The next experiment evaluates the performance of the proposed methodologies. Table 1 presents the performance of the classical BoVW model and our 3 proposed image representations. It shows our 3 proposals outperform the BoVW model for all features and almost on all datasets. We observe that HOG descriptor is not adapted to UKB dataset, it will affect further results. Caltech results are very stable, there is no interesting improvement. With hindsight, we suppose it is due the fact that Caltech is a classification benchmark and the image representation has little effect on its results. Further experiments will show similar observations on this dataset.

Table 1 n-BoVW performance with one feature

Full size table

Table 2 presents the results of our proposed approaches when the features are concatenated together. Note that CMI.SURF denotes the concatenation at the end of the process of the two final histograms for retrieval. Obviously, the effect of the concatenation is positive. We observe that concatenating all 3 visual features mostly results in improving the average performance. However, as we observed previously, HOG degrades the performance of the framework with the UKB benchmark. We also notice some very interesting figures that are highlighted in bold in the table. For example, a score of 3.48 (out of 4) on UKB using 2-BoVW₁ with the concatenation of CMI.SURF is very high compared to BoVW (+ 37%). These two tables clearly highlight the interest of our proposals.

Table 2 n-BoVW performance concatenating features

Full size table

Performance combining methodologies

As the proposed methodologies present good performance with different image representations, we mix the obtained histograms together. The goal here is to evaluate the possible benefits of using a mixing strategy between the different image representations. As studying all possible mixing strategies is not in the scope of this paper, we select a simple one: concatenating the resulting histograms.

Table 3 shows obtained results with the concatenation of the 3 selected features CMI.SURF.HOG except for UKB where we only present CMI.SURF performance due to the observed poor performance of HOG on this dataset (2-BoVW_1.2.3 drops to 3.22 with CMI.SURF.HOG). Except the very interesting results mixing 2-BoVW₁ and BoVW (3.5), we observe that concatenating the histograms tends to increases the performance of our approach. The dimension of the histogram is higher, and consequently more discriminative. If this simple strategy improves the performance then more adapted ones will surely give higher performance.

Table 3 Performance combining by concatenation all proposals using CMI.SURF.HOG features

Full size table

4.2 Discussion

The observed results show the interest of n-BoVW with the 3 methodologies we have proposed combined with the BoVW model. The precision of the retrieval is clearly higher than the BoVW alone. Most of literature approaches have indeed improved the BoVW model but needed some indexing structure to decrease the loss in efficiency for the retrieval. Constructing the image signature with our framework is obviously more complex than the BoVW model: for one image, 4 histograms are created and combined. The most complex one is the second methodologies n-BoVW₂ because it needs to sort all image key-points to pick n nearest ones. Constructing the image signature using our model is 5 times more complex than using the BoVW model. However, we remind the reader that indexing an image dataset is an offline process. Having a more discriminative image representation is the preponderant issue. At the end of the indexing stage, each image is represented by a signature of the same dimension (i.e. the vocabulary size) than the BoVW one. Thus, the increase in complexity has little effect in the global retrieval process.

Only two steps are of importance in the retrieval process: (i) computing the similarity measure between the query image and all other images of the dataset and (ii) sorting the similarity values to obtain the nearest neighbors of the query image. Those two steps mainly depend on the number of images in the dataset I, the used similarity measures (dL2 in our case), and the signature dimension dim. Thus, the theoretical efficiency of the retrieval is O(I ∗ dim²). Our experimental analysis on the efficiency of the retrieval gives the following average query time on a single core processor 2.4 Ghz with respect to the datasets: 500 milliseconds for Corel, 700 ms for Holidays and 1.4 s for Nister. As the dimension of the signature is similar for each of our proposed models compared to BoVW model, we obtain approximately similar retrieval times for all. When we concatenate with AlexNet histogram or between 2 of our proposals, it doubles the dimension of the signature and increases a little the retrieval process (approx. 1 s for Corel and Holidays and 2 s for Nister). Note that, an approximate nearest neighbors approach would improve those retrieval times, but it is not in the scope of this paper.

Another point of discussion we highlight is the possibility that for more than one word to describe a key-point could add noise in the image description. Thus, we have tried to put a distance threshold in our algorithm. The visual phrases constructed with words too “far” should not be taken into consideration, replacing the visual phrase by only one word. Figure 7 presents the observed results on 3 datasets with respect to the percentage of visual phrases used for CMI descriptor. Note that other features result similarly. The results are a bit surprising because best results are achieved with a percentage of visual phrases close to 100%. Thus, we may conclude all visual phrases are needed in n-BoVW even if results still improve when combining with BoVW.

Finally, we compare our approach against few state-of-the-art methods in Table 4. We give here results given by authors when available (na when not available) or computed by ourselves. Note that Caltech results are not presented as Caltech is more a classification benchmark and most literature results use a learning strategy. We divide Table 4 into three parts. First, we show state-of-the-art well known approaches that construct efficient and effective image signatures as it is the case of our approach. For those signature based approaches, we observe easily that our proposed approach mostly outperforms all other methods. These results highlight the discriminative performance of our methodology.

Table 4 n-BoVW vs other methods

Full size table

The second part of Table 4 compares our method to well know transfer learning based approaches, i.e. AlexNet [19] and R-CNN [10]. The performance of our approach using only the signature is not too far from those CNN based approaches. However, as this comparison seemed unfair in our view (due to the huge learning phase of CNN aproach), we combine our signature approach to AlexNet [19]). This concatenation of features from our approach and a transfer learning based approach improved greatly the results of both on all three datasets and outperforms transfer learning based approaches.

Last part of Table 4 presents results from two well known deep learning approaches. We present the results of those supervised methods to show that our unsupervised approach is not far from obtaining a similar performance than deep learning approaches. It shows the discriminative power of our proposed image representation.

We also remind the reader that the motivation of this paper is to create a better image representation for cultural heritage datasets with no prior knowledge and without tuning our approach or adding pre or post processing methodologies. As an example of possible improvements, we did a first experience with an automatic query expansion methodology [6] where all nearest neighbors of the nearest neighbors are used for query expansion, and we obtain a UKB score between 3.65 and 3.7 which is above all presented results. Another possibility is to use the CNN feature as a local feature in our framework. However, it might not be adapted to this specific cultural heritage context, our future work will further study those strategies.

5 Conclusion

This paper presents a more discriminative BoVW framework called n-BoVW. Different methodologies based on visual phrases model were proposed with for all, results outperforming the BoVW model and other recent approaches on all test datasets. Mixing our proposed model together with a transfer learning CNN model improves greatly the performance of both models. Another contribution of this paper is the proposed binary based compression method. It allows the proposed framework to have a similar computational cost than the BoVW model for retrieval. Extensive experiments also show the interest of our approach even compared to deep learning approaches. Our perspective will focus first on the notion of distance from a key-point to a visual word discussed in the previous section. We believe it could be useful to adapt automatically the parameter n for each key-point. Thus, different lengths of visual phrases could represent each key-point of the image. Our future work will also focus on active learning strategy that can be applied on our framework with respect to our specific applicative context of small cultural heritage datasets in collaboration with experts from the field.

References

Aggarwal CC (2015) Data mining—the textbook. Springer, Berlin
MATH Google Scholar
Alqasrawi Y, Neagu D, Cowling PI (2013) Fusing integrated visual vocabularies-based bag of visual words and weighted colour moments on spatial pyramid layout for natural scene image classification. Signal Image Video Process 7(4):759–775
Article Google Scholar
Babenko A, Lempitsky VS (2015) Aggregating local deep features for image retrieval. In: 2015 IEEE international conference on computer vision, ICCV 2015, Santiago, Chile, December 7–13, 2015. IEEE Computer Society, pp 1269–1277
Babenko A, Slesarev A, Chigorin A, Lempitsky VS (2014) Neural codes for image retrieval. In: Fleet D J, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision—ECCV 2014—13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I, volume 8689 of Lecture Notes in Computer Science. Springer, Berlin, pp 584–599
Google Scholar
Bay H, Tuytelaars T, Gool L (2006) Surf: speeded up robust features. In: Leonardis A, Bischof H, Pinz A (eds) Computer vision—ECCV 2006, volume 3951 of Lecture Notes in Computer Science. Springer, Berlin, pp 404–417
Chapter Google Scholar
Carpineto C, Romano G (2012) A survey of automatic query expansion in information retrieval. ACM Comput Surv 44(1):1:1–1:50
Article Google Scholar
Chatoux H, Lecellier F, Fernandez-Maloigne C (2016) Comparative study of descriptors with dense key points. In: 23rd international conference on pattern recognition, ICPR 2016, Cancún, Mexico, December 4–8, 2016, pp 1988–1993
Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, pp 1–22
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL visual object classes challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/-VOC/voc2012/workshop/index.html
Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014. IEEE Computer Society, pp 580–587
Harris Z (1954) Distributional structure. Word 10(23):146–162
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR arXiv:1512.03385
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR arXiv:1502.01852
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. CoRR arXiv:1603.05027
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR arXiv:1502.03167
Jégou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Forsyth AZD, Torr P (eds) European conference on computer vision, volume I of LNCS. Springer, pp 304–317
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 23rd IEEE conference on computer vision & pattern recognition (CVPR ’10). IEEE Computer Society, San Francisco, pp 3304–3311
Jiang W, Zhao Z, Su F (2016) Bayes pooling of visual phrases for object retrieval. Multimedia Tools Appl 75(15):9095–9119
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25. Curran Associates Inc., pp 1097–1105
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR 2006), 17–22 June 2006, New York, pp 2169–2178
Li F, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70
Article Google Scholar
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: Schuurmans D, Wellman MP (eds) Proceedings of the thirtieth AAAI conference on artificial intelligence, February 12–17, 2016. AAAI Press, Phoenix, pp 1266–1272
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: recognizing complex activities from sensor data. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp 1617–1623
Lowe DG (1999) Object recognition from local scale-invariant features. In: International conference on computer vision, vol 2, pp 1150–1157
Nistér D, Stewénius H (2006) Scalable recognition with a vocabulary tree. In: IEEE conference on computer vision and pattern recognition (CVPR), vol 2, pp 2161–2168
Ouni A, Urruty T, Visani M (2017) Improving the discriminative power of bag of visual words model. In: MultiMedia modeling—23rd international conference, MMM 2017, Reykjavik, Iceland, January 4–6, 2017, Proceedings, Part II, pp 245–256
Google Scholar
Pedrosa G, Traina A (2013) From bag-of-visual-words to bag-of-visual-phrases using n-grams. In: 2013 26th SIBGRAPI—conference on graphics, patterns and images (SIBGRAPI), pp 304–311
Perronnin F, Dance CR (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE computer society conference on computer vision and pattern recognition (CVPR 2007), 18–23 June 2007. IEEE Computer Society, Minneapolis
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Ren Y, Bugeau A, Benois-Pineau J (2014) Bag-of-bags of words irregular graph pyramids vs spatial pyramid matching for image retrieval. In: 2014 4th international conference on image processing theory, tools and applications (IPTA), pp 1–6
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the international conference on computer vision, pp 1470–1477
Szegedy C, Ioffe S, Vanhoucke V (2016) Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR arXiv:1602.07261
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. CoRR arXiv:1409.4842
van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596
Article Google Scholar
Wang JZ, Li J, Wiederhold G (2001) Simplicity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell 23(9):947–963
Article Google Scholar
Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. CoRR arXiv:1607.06215
Yang Y, Newsam SD (2011) Spatial pyramid co-occurrence for image classification. In: Metaxas DN, Quan L, Sanfeliu A, Gool LJV (eds) IEEE international conference on computer vision, ICCV 2011, Barcelona, Spain, November 6–13, 2011. IEEE Computer Society, pp 1465–1472
Yeganli F, Nazzal M, Özkaramanli H (2015) Image super-resolution via sparse representation over multiple learned dictionaries based on edge sharpness and gradient phase angle. Signal Image Video Process 9:285–293
Article Google Scholar

Download references

Acknowledgments

This research is supported by the Poitou-Charentes Regional Founds for Research activities and the European Regional Development Founds (ERDF) inside the e-Patrimoine project from the ax 1 of the NUMERIC Program.

Author information

Authors and Affiliations

XLIM, UMR CNRS 7252, University of Poitiers, Poitiers, France
Achref Ouni & Thierry Urruty
Laboratory L3i, University of La Rochelle, La Rochelle, France
Muriel Visani
Vietnam-France ICT Laboratory, University of Science and Technology of Hanoï, Hanoi, Vietnam
Muriel Visani
Laboratory LaBRI, University of Bordeaux, Bordeaux, France
Muriel Visani

Authors

Achref Ouni
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Urruty
View author publications
You can also search for this author in PubMed Google Scholar
Muriel Visani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thierry Urruty.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ouni, A., Urruty, T. & Visani, M. A robust CBIR framework in between bags of visual words and phrases models for specific image datasets. Multimed Tools Appl 77, 26173–26189 (2018). https://doi.org/10.1007/s11042-018-5841-8

Download citation

Received: 09 June 2017
Revised: 10 December 2017
Accepted: 26 February 2018
Published: 12 March 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11042-018-5841-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A robust CBIR framework in between bags of visual words and phrases models for specific image datasets

Abstract

Similar content being viewed by others

Improving the Discriminative Power of Bag of Visual Words Model

Enhanced bag of visual words representations for content based image retrieval: a comparative study

Robust Visual Vocabulary Based On Grid Clustering

1 Introduction

2 State of the art