1 Introduction

Content Based Image Retrieval (CBIR) has been an active field in the last decades, and is still active as the amount of data available increases exponentially. This phenomenon has highlighted the needs for efficient and effective tools to manage this amount of data. One important topic from CBIR field is the construction of the image signature. Indeed, image signature is at the core of any CBIR system. An accurate and discriminative signature will improve the precision of the retrieval process. It will also help to bridge the well-known semanticgap issue between low-level features and the semantic concepts a user perceived in the image. Figure 1 illustrates the overall flowchart of CBIR systems. The research work proposed in this paper focuses on the image signature construction step.

Fig. 1
figure 1

General CBIR system

In this paper, we focus on “expert image datasets”, which are of interest for domain expert end users. The expert end-users can be clinicians, historians, digital curators, numismatists, etc. Those expert datasets may have quite heterogeneous contents (e.g. historic images of persons, constructions, etc.) or more specific contents (e.g. datasets of old coins, butterflies, etc.). The particularities of such datasets have to be considered in the indexing tools that will help them to manage their data for further image exploitation stages as retrieval or browsing. Moreover, improving the image representation will benefit other domain applications as action or activity recognition applications in computer vision [22, 23].

Our study, led in the framework of a cultural heritage project, aims at indexing expert image datasets of limited sizes (around 10000 images). These datasets are very specific and dedicated to the cultural heritage domain with very little or no prior knowledge on each image.

Among the numerous state-of-the-art approaches that have tried to “narrow down” the semantic gap, some of them have improved the descriptive power of visual features [35], improving gradually the existing local and global image descriptors, while others have proposed effective ways to use, mix and optimize the use of these features. Among them, the bag of visual words model (BoVW) [8, 32] has become a reference in CBIR. The BoVW model represents images as histograms of visual words, enhancing the retrieval efficiency without losing much accuracy.

More recently, some researchers have stated that the BoVW discriminative power was not enough and have proposed to construct visual phrases or bags of bags of words by structuring visual words together using different means. However, Bag of visual phrases models [27, 30, 38] are computationally expensive.

Many new learning algorithms and architectures are currently being developed using deep neural networks, in order to further enhance their performances and/or effectiveness. Since the AlexNet architecture that achieved the ImageNet 2012 milestone [19], current developments for object classification include the development of very deep networks such as VGG16/VGG19 [31] and MSRANET [13], the development of wider deep networks, e.g. by adding Inception modules in the architecture [15, 34] and the development of wider and deeper, yet efficient, networks such as ResNet [12], Inception V4 + ResNet [33] and Identity ResNet [14]. When semantic information is available, a wide spectrum of cross-modal deep learning approaches exists in the literature [37].

Because of their very numerous parameters to adjust, learning based methodologies including deep learning models generally require massive amounts of annotated data to be trained, which might be very costly and even almost impossible in some applications. As previously mentioned, our objective is to index small and dedicated expert datasets of cultural heritage with no prior knowledge on the data. Thus, we concentrate our study on improving the image representation without using learning algorithms, inappropriate in our case.

Thus, this paper presents our n-Bag of visual words framework, denoted n-BoVW, which increases the discriminative power of the BoVW model. Our framework is unsupervised and has no need of semantic annotations to create the image representation. It is an extension of the framework we initially proposed in [26], with a new image representation proposal and a complete study of the framework through extensive experiments. n-BoVW uses the idea of visual phrases by selecting multiple visual words to represent each key-point but keeps the efficiency of BoVW model with a compression algorithm. Three methodologies are proposed and combined with the BoVW model to obtain our final image representation. Our experimental results on different “generic” benchmark datasets highlight the potential of our proposal.

The remainder of this article is structured as follows: we provide a brief overview of bag of visual words and phrases related works in Section 2. Then, we explain our different proposals in Section 3. We present the experiments on 4 different datasets and discuss the findings of our study in Section 4. Section 5 concludes and gives some perspectives to our work.

2 State of the art

We present in this section a brief overview of unsupervised approaches the literature of CBIR field that is linked to the BoVW model proposed by Csurka et al. [8]. Its inspiration comes from the Bag of Words model [11] of the Information Retrieval domain. BoVW model contains four main parts in its retrieval framework. For all images, feature detection and extraction has to be done. These two steps detect a list of key-points with rich visual and local information and convert this information into a vector. Many visual descriptors have been created, among them the Scale Invariant Feature Transform (SIFT) [24] and Speeded-up Robust Features (SURF) [5] became two of the most popular descriptors. A recent comparative study details the more recent visual descriptors [7]. Then, an off-line process extracts the visual vocabulary, a set of visual words, using a clustering algorithm on the set of visual features. Finally, each key-point of each image is assigned to the closest visual word of the vocabulary. Thus, each image is represented by a histogram of visual word frequencies, i.e. the image signature.

Inspired by the BoVW model, Fisher Kernel [28] or Vector of Locally Aggregated Descriptors (VLAD) [17] have met with great success. The first approach proposed by Perronnin and Dance [28] applies Fisher Kernels to visual vocabularies represented by means of a Gaussian Mixture Model (GMM). VLAD has been introduced by Jégou et al. [17] and can be seen as a simplification of the Fisher kernel. The idea of VLAD is to assign each key-point to its closest visual word and accumulate this difference for each visual word.

Recently, some researchers have focused on improving the discriminative power of the BoVW model. Thus, they have proposed to construct visual phrases or groups/bags of bags of words. Among them, we can cite the work of Yang et al. [38] or Alqasrawi et al. [2] who have used the spatial pyramid representation [20] to construct visual phrases from words spatially close or co-occurring in the same sub-region. They obtained good results for classification purposes. Ren et al. [30] have extended the BoVW model into Bag of Bags of Visual Words. They have proposed an Irregular Pyramid Matching with the Normalized Cut methodology to subdivide the image into a connected graph. Jiang et al. [18] have tried to overcome the problem of visual phrase burstiness, i.e. repetition of visual phrases influencing the image similarity measure. They have proposed a framework combining geometry constraints and probabilistic filtering. Other researchers have chosen to mix several vocabularies with different image resolutions as Yeganli et al. [39].

Most of those methodologies, by considering more meaningful combinations of words, reach a better effectiveness than the original BoVW model. However, this improved performance can only be reached at the cost of efficiency, as the processes for extracting/matching word combinations are generally quite costly. Our proposal tends to avoid this efficiency loss while keeping high performance. Moreover, compared to the state-of-the-art approaches using bag of visual phrases, we propose to improve the description of each keypoint with its neighborhood, whereas BoVP approaches linked key-points together. We also remind the reader that our proposal is unsupervised which makes it useful for expert datasets of different applicative contexts.

3 Approach

In this section, we present our global framework before detailing our contributions. Motivated by our specific context of small expert datasets without prior knowledge on the images, our main objective is to improve the image representation. Our unsupervised approach increases the discriminative power of the image signature without losing much efficiency in the retrieval process. Our main objective is to propose an alternative to supervised approaches, including deep learning approach, with close performance.

Thus, we use an CBIR framework without any filtering on image, refining process or data augmentation methodology on the used visual features nor the constructed vocabularies. It insures the reproducibility of our results.

Figure 2 presents the different steps of our global framework. In the top part of Fig. 2, we find the detection and extraction steps for each image of the dataset. Then, using the visual vocabulary constructed previously, we use four different ways to construct the image signature. First “column” is the classical BoVW model that gives for each image a histogram of visual word frequencies as signature (which will be binarized to be combined). As most CBIR systems using the BoVW model are similar, we have a standard off-line learning process to construct the visual vocabulary on a separate dataset.

Fig. 2
figure 2

Global framework

The other “columns” are our first contribution, an approach we denote n-Bag of Visual Words (n-BoVW). n-BoVW selects n visual words from the vocabulary to represent each detected key-point by a visual phrase. Our motivation in affecting n words to each selected key-point is to increase the discriminative power of the BoVW model. Three different methodologies have been studied: (i) selecting the n closest visual words from a key-point (second “column” in the image), (ii) clustering n nearest key-points together in the visual feature space to obtain a list of n visual words, one word by key-point inside the small cluster and (iii) a list of n visual words taken from the cluster of the nearest visual word from the key-point. Compared to state-of-the-art approaches that used Bags of visual phrases grouping key-points of a similar region of object (e.g. [18, 30]), our different proposals give multiple words to each key-point improving its discriminative power.

Indexing images with a list of visual words by key-point, i.e. a visual phrase, is more complex than assigning a single word by key-point, particularly for the retrieval process which needs to compute the similarity between sets of visual phrases instead of a word frequency histograms. Thus, for the three proposals, our second contribution is to propose a compression process that ensures an efficient retrieval. During this process, the list of visual phrases is reduced to a simple histogram of frequencies similarly to the BoVW histogram. A final combining step is also proposed to construct the final signature of the image. This step mixes the obtained histograms to improve the discriminative power of the image signature. The following subsections detail these proposals. Note that the retrieval step is common to most CBIR systems: we compute the nearest neighbors of a query image using a similarity metric between the image signatures.

3.1 n-bag of visual words methodologies

As mentioned in Section 2, visual phrases group visual words together to be more discriminative. Our first contribution presents three different methodologies to better describe or represent each key-point by n visual words. Note that visual phrase models from the literature usually take n words from different key-points with the objective to better represent the near sub-region. Our approach differs as we aim at providing a more precise description of each key-point using a small vocabulary size.

The first methodology we propose is to select n visual words from the same visual vocabulary to represent each key-point of an image, referred as n-BoVW1. Let W denote the vocabulary of v visual words vw1,...,vwv constructed using an offline process on a separate dataset. Let KPi be the set of key-points extracted by the detection step for image i, with kpip the p-th key-point of image i. For each kpip, we compute the Euclidean distance (dL2) between the key-point and each visual word vwj from the vocabulary W.

$$ dL2(kp_{ip}, w_{j}) = \sqrt{\sum\limits^{dim}_{d = 1}(f_{ip_{d}} - vw_{j_{d}})^{2}}, $$
(1)

where \(f_{ip_{d}}\) is the d-th value of the extracted visual feature f of dimension dim for kpip.

Then, W is sorted according to these distances in order to pick the n nearest visual words from kpip. Thus, for each key-point kpip, we obtain a visual phrase vp1ip, i.e. a set of n distinct visual words \({vw1_{1_{ip}}, ..., vw1_{n_{ip}}}\). An example is given Fig. 3 with n = 2 for better understanding. We can see kp2 two nearest visual words are vw1 and vw6, thus kp2 is represented by the visual phrase (vw1, vw6), similarly, kp8 and kp5 are represented by (vw5, vw3) and (vw1, vw7) respectively.

Fig. 3
figure 3

Example for n-BoVW1

With this first methodology, we ensure the description of a key-point by a visual phrase of n distinct words. However, it never takes into account the possibility that the key-point could be better represented by only one and unique word.

Our second proposal, referred as n-BoVW2, is based on this non-possibility we mention and also the fact that a key-point description could be a bit noisy, thus it is interesting to look at his surrounding directly in the descriptor (or visual feature) space. A bit as strong clustering algorithm works, we gather the nearest key-points in the visual feature space to form a strong choice of visual words. Each of those selected key-points is then linked to only one visual word. Note that, the probability to have nearest visual features represented by similar visual words is high. So, it allows the possibility to have only one representative visual word for the small cluster of key-points.

For each key-point kpip, we compute the Euclidean distances with KPi, i.e. the other key-points of image i. These distances are then sorted in order to retrieve the n nearest key-points in the visual feature spaces (including the current key-point itself). This set of nearest key-points NKPip is then used to select the representative visual words. For each key-point of NKPip, its nearest visual word is calculated using the L2-distance. At the end, for each key-point kpip we also obtain a visual phrase vp2ip, i.e. a set of n visual words \({vw2_{1_{ip}}, ..., vw2_{n_{ip}}}\) with a high probability of duplicates. An example is given in Fig. 4. We can see kp2 nearest key-point is kp5, both key-points are represented by vw1, so we have only vw1 to represent kp2 which is different from first method. However, kp8 with the link to its nearest neighbor kp7 is still represented by the same visual phrase (vw2, vw3), and kp5 (link to kp6) is represented by (vw1, vw7).

Fig. 4
figure 4

Example for n-BoVW2

Our last proposal, referred as n-BoVW3, links a key-point to a set of visual words that are inside the same cluster. For this part, we cluster the vocabulary into small groups of n visual words. This set of visual words represents a larger region of the vocabulary space than only one word. The first step is to create the visual phrase vocabulary VP using a clustering algorithm. To do so, we tried both a hierarchical cluster analysis (HCA with complete linkage) and a space-partitioning data structure, the Kd-tree [1]. Note that our framework mostly needs small clusters, so we select half of the vocabulary size as the number of clusters as a good starting point. Tuning this parameter is out of the scope of this paper. Of course, some clusters may only contain 1 word, while others two, three or more words. The next step is to assign a visual phrase from VP to each key-point by selecting the nearest visual words and its associated neighbors.

Figure 5 gives an example of this proposal. Both key-points kp2 and kp5 are linked to the visual word vw1. vw1 and vw7 belong to the same cluster C1. Thus, the visual phrase (vw1, vw7) represents those two key-points. If we look at kp8, the nearest word is vw3 (as in BoVW model) which is in the same cluster C4 than vw5 and vw2. Thus, all possible visual phrases ((vw3, vw5), (vw5, vw2), (vw3, vw2)) using those three visual words will be added to the image representation. This pedagogical example shows that this proposal also differs from BoVW, and the first approaches (n-BoVW1 and n-BoVW2). This difference is amplified with real data.

figure d
Fig. 5
figure 5

Example for n-BoVW3

3.2 Compression algorithm

The main disadvantage of obtaining visual phrases from the proposed methodologies is the number of phrase possibilities, i.e. \(\frac {v!}{(v-n)!n!}\) which is computationally too high for a retrieval system. To deal with this phenomenon, we propose a compression algorithm that is useful for all of our proposals.

We first noticed in literature approaches that visual phrases of only 2 words give better performance [2, 38]. So, for each key-point visual phrase vpip of n words, we construct all possible combinations of 2 visual words. Then, we also observed that for BoVP model approaches with a high number of phrases, only the presence or the absence of the visual phrase is enough to be discriminative. Thus, we decide to binarize the presence of visual phrases in one image. The final step of our compression methodology sums the presence of a word in distinct visual phrases.

The results of our proposal is an histogram of v bins as image signature which is similar to the BoVW model. However, in our approach the histogram contains the frequencies of a word appearing in distinct visual phrases extracted at each key-point.

figure e

Algorithms 1 and 2 give the global approach with the binary based compression method. Some part of those algorithms are more detailed to be easier to understand but are obviously optimized in our real code.

Of course, the information gathered from the methodologies described previously is different, even from the standard BoVW model. Thus, it is relevant to try and combine the histograms from BoVW model and all n-BoVW methodologies. Out of different solutions we have tried, concatenating the descriptor histograms of visual phrases after the binary based compression process has given the best results, even for a concatenation of visual features. However, trying to find and tune the best mixing algorithm is not in the scope of this paper and could be part of our future work. Our experimental results will study the performance of each proposal separately (n-BoVW1, n-BoVW2, and n-BoVW3) and then combine them by concatenation (e.g. n-BoVW1.2 for the concatenation of first two proposals).

4 Experimental results

In this section, we present the experiments done to highlight the potential of our approach. The goal of CBR system is to retrieve most similar images from a query one. To evaluate our different propositions, 3 well known and commonly used low level visual features, Speeded-up Robust Features (SURF), Color Moment Invariant (CMI) and Histogram of oriented gradients (HOG), and 4 state-of-the-art retrieval datasets were considered:

University of Kentucky Benchmark :

which has been proposed by Nistér et al. [25] is referred as UKB to simplify the reading. UKB contains of 10200 images divided into 2550 groups, each group consists of 4 images of the same object with different conditions (rotated, blurred...). The score is the mean precision over all images for 4 nearest neighbors.

INRIA Holidays :

[16], referred as Holidays, is a collection of 1491 images, 500 of them are query images, and the remaining 991 images are the corresponding relevant images. The evaluation on Holidays is based on mean average precision score (mAP) [29].

Corel1000 or Wang :

[36], referred as Wang, is a collection of 1000 images of 10 categories. The evaluation is the average precision of all images for first 100 nearest neighbors.

Caltech101 :

[21], referred as Caltech, is a collection of 101 categories containing between 40 and 800 images each. The evaluation is the average precision of all images for first 100 nearest neighbors. Note that Caltech is more used as a classification benchmark.

To construct the initial visual vocabulary, we used the PASCAL VOC 2012 [9] containing 17225 heterogeneous images categorized into 20 object classes. We use a visual vocabulary of 500 words for each descriptor. We have not selected a bigger dataset for the vocabulary learning (and obtain probably better results) because we simulate the cultural heritage context of small expert datasets.

4.1 Performance of n-BoVW

First, we study the effect of the parameter n. Figure 6 shows the performance of the retrieval on the 4 datasets using the n-BoVW clustering proposal with a selection of only 2 descriptors for readable purpose (CMI and SURF). It clearly indicates that n > 2 has very little interest (n = 3 is slightly better on Caltech dataset). Similar observations have also been noticed with the HOG descriptor and also for other methodologies: sometimes, for n > 2, the precision is stable, sometimes it drops little by little. This result is similar to literature visual phrases results where most approaches construct visual phrases of 2 words [2, 38]. Thus, we decide to focus our study with n = 2 even if n = 3 could give little improvement with a specific dataset and descriptor.

Fig. 6
figure 6

Study of the effect of parameter n, number of visual words in phrases

The next experiment evaluates the performance of the proposed methodologies. Table 1 presents the performance of the classical BoVW model and our 3 proposed image representations. It shows our 3 proposals outperform the BoVW model for all features and almost on all datasets. We observe that HOG descriptor is not adapted to UKB dataset, it will affect further results. Caltech results are very stable, there is no interesting improvement. With hindsight, we suppose it is due the fact that Caltech is a classification benchmark and the image representation has little effect on its results. Further experiments will show similar observations on this dataset.

Table 1 n-BoVW performance with one feature

Table 2 presents the results of our proposed approaches when the features are concatenated together. Note that CMI.SURF denotes the concatenation at the end of the process of the two final histograms for retrieval. Obviously, the effect of the concatenation is positive. We observe that concatenating all 3 visual features mostly results in improving the average performance. However, as we observed previously, HOG degrades the performance of the framework with the UKB benchmark. We also notice some very interesting figures that are highlighted in bold in the table. For example, a score of 3.48 (out of 4) on UKB using 2-BoVW1 with the concatenation of CMI.SURF is very high compared to BoVW (+ 37%). These two tables clearly highlight the interest of our proposals.

Table 2 n-BoVW performance concatenating features

Performance combining methodologies

As the proposed methodologies present good performance with different image representations, we mix the obtained histograms together. The goal here is to evaluate the possible benefits of using a mixing strategy between the different image representations. As studying all possible mixing strategies is not in the scope of this paper, we select a simple one: concatenating the resulting histograms.

Table 3 shows obtained results with the concatenation of the 3 selected features CMI.SURF.HOG except for UKB where we only present CMI.SURF performance due to the observed poor performance of HOG on this dataset (2-BoVW1.2.3 drops to 3.22 with CMI.SURF.HOG). Except the very interesting results mixing 2-BoVW1 and BoVW (3.5), we observe that concatenating the histograms tends to increases the performance of our approach. The dimension of the histogram is higher, and consequently more discriminative. If this simple strategy improves the performance then more adapted ones will surely give higher performance.

Table 3 Performance combining by concatenation all proposals using CMI.SURF.HOG features

4.2 Discussion

The observed results show the interest of n-BoVW with the 3 methodologies we have proposed combined with the BoVW model. The precision of the retrieval is clearly higher than the BoVW alone. Most of literature approaches have indeed improved the BoVW model but needed some indexing structure to decrease the loss in efficiency for the retrieval. Constructing the image signature with our framework is obviously more complex than the BoVW model: for one image, 4 histograms are created and combined. The most complex one is the second methodologies n-BoVW2 because it needs to sort all image key-points to pick n nearest ones. Constructing the image signature using our model is 5 times more complex than using the BoVW model. However, we remind the reader that indexing an image dataset is an offline process. Having a more discriminative image representation is the preponderant issue. At the end of the indexing stage, each image is represented by a signature of the same dimension (i.e. the vocabulary size) than the BoVW one. Thus, the increase in complexity has little effect in the global retrieval process.

Only two steps are of importance in the retrieval process: (i) computing the similarity measure between the query image and all other images of the dataset and (ii) sorting the similarity values to obtain the nearest neighbors of the query image. Those two steps mainly depend on the number of images in the dataset I, the used similarity measures (dL2 in our case), and the signature dimension dim. Thus, the theoretical efficiency of the retrieval is O(Idim2). Our experimental analysis on the efficiency of the retrieval gives the following average query time on a single core processor 2.4 Ghz with respect to the datasets: 500 milliseconds for Corel, 700 ms for Holidays and 1.4 s for Nister. As the dimension of the signature is similar for each of our proposed models compared to BoVW model, we obtain approximately similar retrieval times for all. When we concatenate with AlexNet histogram or between 2 of our proposals, it doubles the dimension of the signature and increases a little the retrieval process (approx. 1 s for Corel and Holidays and 2 s for Nister). Note that, an approximate nearest neighbors approach would improve those retrieval times, but it is not in the scope of this paper.

Another point of discussion we highlight is the possibility that for more than one word to describe a key-point could add noise in the image description. Thus, we have tried to put a distance threshold in our algorithm. The visual phrases constructed with words too “far” should not be taken into consideration, replacing the visual phrase by only one word. Figure 7 presents the observed results on 3 datasets with respect to the percentage of visual phrases used for CMI descriptor. Note that other features result similarly. The results are a bit surprising because best results are achieved with a percentage of visual phrases close to 100%. Thus, we may conclude all visual phrases are needed in n-BoVW even if results still improve when combining with BoVW.

Fig. 7
figure 7

Performance of n-BoVW with respect to the % of visual phrases used for CMI

Finally, we compare our approach against few state-of-the-art methods in Table 4. We give here results given by authors when available (na when not available) or computed by ourselves. Note that Caltech results are not presented as Caltech is more a classification benchmark and most literature results use a learning strategy. We divide Table 4 into three parts. First, we show state-of-the-art well known approaches that construct efficient and effective image signatures as it is the case of our approach. For those signature based approaches, we observe easily that our proposed approach mostly outperforms all other methods. These results highlight the discriminative performance of our methodology.

Table 4 n-BoVW vs other methods

The second part of Table 4 compares our method to well know transfer learning based approaches, i.e. AlexNet [19] and R-CNN [10]. The performance of our approach using only the signature is not too far from those CNN based approaches. However, as this comparison seemed unfair in our view (due to the huge learning phase of CNN aproach), we combine our signature approach to AlexNet [19]). This concatenation of features from our approach and a transfer learning based approach improved greatly the results of both on all three datasets and outperforms transfer learning based approaches.

Last part of Table 4 presents results from two well known deep learning approaches. We present the results of those supervised methods to show that our unsupervised approach is not far from obtaining a similar performance than deep learning approaches. It shows the discriminative power of our proposed image representation.

We also remind the reader that the motivation of this paper is to create a better image representation for cultural heritage datasets with no prior knowledge and without tuning our approach or adding pre or post processing methodologies. As an example of possible improvements, we did a first experience with an automatic query expansion methodology [6] where all nearest neighbors of the nearest neighbors are used for query expansion, and we obtain a UKB score between 3.65 and 3.7 which is above all presented results. Another possibility is to use the CNN feature as a local feature in our framework. However, it might not be adapted to this specific cultural heritage context, our future work will further study those strategies.

5 Conclusion

This paper presents a more discriminative BoVW framework called n-BoVW. Different methodologies based on visual phrases model were proposed with for all, results outperforming the BoVW model and other recent approaches on all test datasets. Mixing our proposed model together with a transfer learning CNN model improves greatly the performance of both models. Another contribution of this paper is the proposed binary based compression method. It allows the proposed framework to have a similar computational cost than the BoVW model for retrieval. Extensive experiments also show the interest of our approach even compared to deep learning approaches. Our perspective will focus first on the notion of distance from a key-point to a visual word discussed in the previous section. We believe it could be useful to adapt automatically the parameter n for each key-point. Thus, different lengths of visual phrases could represent each key-point of the image. Our future work will also focus on active learning strategy that can be applied on our framework with respect to our specific applicative context of small cultural heritage datasets in collaboration with experts from the field.