1 Introduction

Searching for images is a daily task for many medical professionals, especially in image-oriented fields such as radiology (Markonis et al. 2012). However, the huge amount of visual data in hospitals and the medical literature is not always easily accessible and physicians have generally little time for information search as they need to diagnose an increasing number of cases with increasing image details in a limited amount of time.

Therefore, medical image retrieval systems need to return information adjusted to the knowledge level and expertise of the user in a quick and precise fashion. A well known technique trying to improve search results by user interaction is relevance feedback (Rocchio 1971). Relevance feedback allows the user to mark results returned in a previous search step as relevant or irrelevant to refine the initial query. The concept behind relevance feedback is that though users may have difficulties in formulating a precise query for a specific task, they generally see quickly whether a returned result is relevant to the information need or not. This technique found use in image retrieval particularly with the emergence of content-based image retrieval (CBIR) systems (Squire et al. 2000; Taycher et al. 1997; Wood et al. 1998). Following the CBIR mentality, the visual content of the marked results is used to refine the initial image query. With the result images represented as a grid of thumbnails and limited metadata, relevance feedback can be applied quickly to speed up the search iterations and refine results. Recent user-tests with radiologists on a medical image search system also showed that this method is intuitive and straightforward to learn (Markonis et al. 2013).

Depending on whether the user manually provides the feedback to the system (e.g. by marking results) or the system obtaining this information automatically (e.g. by log analysis) relevance feedback can be categorized as explicit or implicit. Moreover, the information obtained by relevance feedback can be used to affect the general behaviour of the system (long-term learning). In Müller et al. (2004) a market basket analysis algorithm is applied in image retrieval using long-term learning. A recent review of short-term and long-term learning relevance feedback techniques in CBIR can be found in Li and Allinson (2013). An extensive survey of relevance feedback in text-based retrieval systems is presented in Ruthven and Lalmas (2003) and for CBIR in Rui et al. (1997). Another survey (Crucianu et al. 2004) gives a good overview on key aspects of image retrieval relevance feedback such as the objectives of image retrieval the main relevance feedback mechanisms and the different evaluation strategies, using real users or pseudo-relevance feedback.

Strategies of simulated feedback are also presented in Müller et al. (2000) where relevance feedback is divided into two main strategies: at the level of results of multiple image queries and at the feature level, creating a pseudo-image out of the set of image queries. The challenges of adding negative feedback to the mechanism are also discussed, as negative examples are often much more than the positives ones, “destroying” thus the query. In Qian et al. (2003) several alternating feature spaces are presented for relevance feedback showing to improve results as exploring new areas of the feature space in each iteration. In Cox et al. (1996) a Bayesian approach for relevance feedback is proposed following a atrget search algorithm. The same authors also explore relevance feedback techniques on small displays (Vinay et al. 2005).

In the medical informatics field, Chen et al. (2011) applies CBIR with relevance feedback on mammography retrieval. In Rahman et al. (2007), an image retrieval framework using relevance feedback is evaluated on a dataset of 5000 medical images that uses support vector machines to compute the refined queries.

There are many existing medical retrieval systems that combine text and visual information: such as NovaSearch (Mourão and Martins 2013), Open-IFootnote 1 and others. Relevance feedback is not always available in these system and often only evaluated in a qualitative manner. In Rahman et al. (2011) an approach for relevance feedback using similarity fusion is shown to improve the retrieval performance in two iterations of medical image search. However it is only evaluated with respect to the size of the shortlist used for the pseudo-relevance feedback and not against other relevance feedback techniques.

In this paper we evaluate different explicit, short-term relevance feedback techniques using visual content or text for medical image retrieval. We propose a technique that combines visual and text-based relevance feedback and show that it achieves a competitive performance to the state-of-the-art approaches.

2 Methods

In this study the same categorization as in Müller et al. (2000) is followed. Two main feedback strategies with respect to the retrieval stage where the relevance feedback information is added to the new query are examined. In Fig. 1 the image retrieval pipeline is shown as well as the steps where relevance feedback can be used. The first strategy is at the feature level, resulting into a single feature representation for all the query images and consequently a single result list. There is no need for result list fusion in this strategy. The second one is performed at the result list fusion step, where each image query has returned a different result list. In the multi-modal approaches the combination of visual and textual information is obtained at the Result list fusion stage for both strategies.

Fig. 1
figure 1

Block diagram of the image retrieval pipeline. Relevance feedback strategies can be applied at the feature representation and result list fusion steps

2.1 Rocchio algorithm

One of the most well known relevance feedback techniques is Rocchio’s algorithm (Rocchio 1971). Its mathematical definition is given below:

$${\mathbf {q}}_m = \alpha {\mathbf {q}}_o + \beta \frac{1}{|D_r|} \sum _{{\mathbf {d}}_j\in D_r}{\mathbf {d}}_j - \gamma \frac{1}{|D_{nr}|} \sum _{{\mathbf {d}}_j\in D_{nr}}{\mathbf {d}}_j $$
(1)

where \({\mathbf {q}}_m\) is the modified query, \({\mathbf {q}}_o\) is the original query, \(D_r\) is the set of relevant images, \(D_{nr}\) is the set of non-relevant images and \(\alpha ,\beta \) and \(\gamma \) are weights.

Typical values for the weights are \(\alpha = 1, \beta = 0.8\) and \(\gamma = 0.2\). Rocchio’s algorithm is typically used in vector space models and also for CBIR. Intuitively, the original query vector is moved towards the relevant vectors and away from the irrelevant ones. By giving a weight to the positive and negative parts a problem of CBIR can be avoided that when more negative than positive feedback exists that also many relevant images disappear from the results set (basically leaving images with few features or features not present in the initial set of images).

2.2 Late fusion

Another technique that showed potential in image retrieval (García Seco de Herrera et al. 2013) is late fusion. Late fusion (Depeursinge and Müller 2010) is used in information retrieval to combine result lists. It can be applied for fusing multiple queries and for multi-modal techniques where results of text and visual retrieval are for example combined. It can also be used for fusing multiple features, even though early fusion is more commonly chosen for this purpose. The concept behind this method is to merge the result lists into a single list while boosting common occurrences using a fusion rule.

For example, the fusion rule of the score-based late fusion method CombMNZ (Shaw and Fox 1994) is defined as:

$$S_{\texttt{combMNZ }}(i)=F(i)* S_{\texttt{combSUM }}(i) $$
(2)

where F(i) is the number of times an image i is present in retrieved lists with a non-zero score, and S(i) is the score assigned to image i. CombSUM is given by

$$S_{\texttt{combSUM }}(i)=\sum _{j=1}^{N_j}{S_j(i)}$$
(3)

where \(S_j(i)\) is the score assigned to image i in retrieved list j.

2.3 Multi-modal relevance feedback

Most of the techniques use vectors either from the text or the visual models. However, it has been shown that approaches that use both text and visual information can outperform single-modal ones in image retrieval if performed carefully (Müller and Kalpathy-Cramer 2010). We propose the use of multi-modal information for relevance feedback to enhance the retrieval performance. This is, to the best of our knowledge, the first time that such a technique is proposed in image retrieval. As late fusion is applied on result lists, it is straightforward to use for combining results from visual and text queries.

2.4 Relevance feedback in multi-lingual queries

Another experiment run in this study is to investigate how the RF performs in more realistic scenarios when automatic spelling correction and language translation may have been applied to the query. For this, an even distribution of spelling errors across the text queries were introduced: diacritics omission, leaving out white space, character omission, character insertion, character replacement and character swapping. Automatic spelling correction was then applied to the queries, while queries in German, French and Czech were automatically translated into English.

2.5 Experimental setup

For evaluating the relevance feedback techniques the following experimental setup was followed: The n search iterations are initiated with a text query in iteration 0. The relevant results from the top k results of iteration i were used in the relevance feedback formula of the iteration \(i+1\) for \(i=0\ldots n-2\).

The image dataset, topics and ground truth of ImageCLEF 2012 medical image retrieval task (Müller et al. 2012) were used in this evaluation. The ad-hoc image based topics were used in our study. The dataset contains more than 300,000 images from the medical open access literature (subset of PubMed Central).

The image captions were accessed by the text-based runs and indexed with the LuceneFootnote 2 text search engine. A vector space model was used along with tokenization, stopword removal, stemming and inverse document frequency-term frequency weighting. The Bag-of-visual-words model described in García Seco de Herrera et al. (2012) and the bag-of-colors model appearing in García Seco de Herrera et al. (2013) were used for the visual modelling of the images. Since only positive feedback ws used in this study, no weights were used for the Rocchio algorithm. In multi-modal runs, the fusion of the visual and text information is performed only for the 1000 top results as in the evaluation of ImageCLEF only the top 1000 documents are taken into account in any case.

Five techniques were evaluated in this study:

  1. 1.

    text: text-based RF using vector space model. Word stemming, tokenization and stopword removal is performed in both text and multi-modal runs.

  2. 2.

    visual_rocchio: visual RF using Rocchio to fuse the relevant image vectors and CombMNZ fusion to fuse the original query results with the visual results.

  3. 3.

    visual_lf: visual RF using late fusion (and the CombMNZ fusion rule) to fuse the relevant image results and the original query results with the visual ones.

  4. 4.

    mixed_rocchio: multi-modal RF using Rocchio to fuse the relevant image vectors and CombMNZ fusion to fuse the original query results with the relevant caption results and relevant visual results.

  5. 5.

    mixed_lf: multi-modal RF using late fusion (and the CombMNZ fusion rule) to fuse the relevant image results and the original query results with the caption text results and relevant visual results.

Regarding the experiment for relevance feedback in combination with the machine translation and automatic spelling-correction, the Health on the Net (HON) spell checker Footnote 3 was used. The number of spellchecked recommendations to return was set to 25. The decision whether to take the highest frequency spelling suggestion (or keep the existing user query as being spelt correctly) as the correctly spelt term was made based on the ratio of spell suggestion frequency to query term frequency in the collection (threshold set to \(\ge \)8:1 experimentally).

The MOSES system (Koehn et al. 2007) was used to automatically translate German, French and Czech queries into English. The ImageCLEF dataset contains translations of the queries for German and French while manual translation was used to translate the queries in Czech.

Three main runs were evaluated for each language:

  • The first run used the queries after being translated into English by the query translation service.

  • The second run used the queries after being translated into English by the query translation service and spelling errors were artificially introduced.

  • The third run used the translated queries as the two runs above with the spelling errors corrected by the spelling correction service.

All the above queries were used as input to the experiment described in the beginning of this section using the best performing RF technique and \(k=100\).

3 Results

The evaluation of the five techniques was performed for \(k=5, 20, 50, 100\) and \(n = 5\). Results of the mean average precision (mAP) of each technique per iteration are shown in Figs. 2, 3, 4 and 5.

Table 1 gives the best mAP scores of each run. The numbers in parentheses are the number of the iteration when this score was achieved. For scores that were the same in multiple iterations of the same run, the iteration closer to the first is used.

Fig. 2
figure 2

Mean average precision per search iteration for \(k=5\)

Fig. 3
figure 3

Mean average precision per search iteration for \(k=20\)

Fig. 4
figure 4

Mean average precision per search iteration for \(k=50\)

Fig. 5
figure 5

Mean average precision per search iteration for \(k=100\)

Table 1 Best mAP scores

Table 2 shows the effect of the translation (row 1), the introduction of spelling errors (row 2), and the automatic spelling correction (row 3) to the retrieval performance before applying any RF.

Table 2 Iteration 0 mAP for each of the languages

Figure 6 demonstrates the mean average precision at iteration 4 of the mixed_lf technique.

Fig. 6
figure 6

Plot of the mixed_lf values for iteration 4 for each of the languages. The values are for, respectively, no spelling error, spelling error introduced and spelling correction applied. Machine translation into English is applied for the French, Czech and German queries (Color figure online)

Table 3 shows the results of a sample query for different categories of relevance feedback methods, illustrating the differences in the results when using text, visual and mixed relevance feedback.

Table 3 Sample query: top five results of iteration 4 \((k=100)\) for the three categories of methods (text, visual and mixed)

4 Discussion

Medical image retrieval from articles in the literature is a challenging task as the image datasets from the biomedical literature are quite noisy (containing many graphs, diagrams and non-medical pictures). Moreover the areas in the images containing pathology related information are small and the differences are quite subtle with usually only a very small portion of the image being relevant.

All of the evaluated techniques improve retrieval after the initial search iteration. This demonstrates the potential of relevance feedback for refining medical image search queries. Relevance feedback using only visual appearance models, even though improving the retrieval performance after the first iteration, performed worse than the text-based runs in most cases. Visual features still suffer from the semantic gap between the expressiveness of visual features and our human interpretation. Still, this shows their usefulness in image datasets where no or few text meta-data are available. Moreover, when combined with the text-information in the proposed method, they improve the text-only baseline. Recently introduced higher-order representations, such as Fisher vectors or vectors of locally aggregated descriptors (VLADs) may further improve the retrieval results in this scenario.

The proposed multi-modal runs provide the best results in all the cases except for case \(k=5\). Surprisingly, the visual runs perform slightly better than the text and the multi-modal approaches for this case. However, assuming independent and normal distributed average precision values the significance tests show that the difference is not statistically significant.

We consider the case \(k = 20\) as the most realistic scenario since users do not often inspect more than two pages of results and most actually stay only on the initial results page. Especially for grid-like result interface views, where each page can contain 20–50 results, we consider \(k=20\) more realistic than \(k=5\). In this case the proposed methods achieve the best performance with 0.2606 and 0.2635 respectively. Again, the significance tests do not find any significant difference between the three best approaches. However, applying different fusion rules for combining visual and text information (such as linear-weighting) could further improve the results of the mixed approaches. It can be noted that as the k increases, the performance improvement also increases, highlighting the added value of relevance feedback. Larger values of k than 100 were not explored as these scenarios were judged as unrealistic.

In the visual runs using Rocchio for combining the visual queries is performing worse than late fusion. This comes in accordance with the findings in García Seco de Herrera et al. (2012). The reason behind this could be that the large visual diversity of relevant images in medicine and the curse of dimensionality cause the modified vector to behave as an outlier in the high dimensional visual feature space. In the mixed runs the difference between the two methods is not statistically significant with Rocchio performing slightly better than the late fusion.

Relevance feedback is shown to be able to improve the retrieval performance in difficult real world scenarios where spelling errors are introduced and corrected as well as machine translation is applied to the queries.

Irrelevant results were ignored, as they often have little or no impact on the retrieval performance (Müller et al. 2000; Salton and Buckley 1997). More importantly, the ground truth of the dataset used contains a much larger portion of annotated irrelevant results than relevant ones. This was considered to potentially simulate an unrealistic scenario, as users do not usually mark many results as negative examples (or only very few). Having too many negative examples could also cause the modified vector to follow an outlier behaviour. Preliminary results confirmed this hypothesis, where the use of negative results for relevance feedback can decrease performance after the first iteration if not handled in a careful way.

A larger number of steps could be investigated but this might be unrealistic, given the fact that physicians have little time and stop after a few minutes of search (Markonis et al. 2012). Often users will only test a few steps of relevance feedback at the most.

In the evaluation of our relevance feedback mechanism we assume a perfect user who marks all relevant items as relevant. As past literature shows (Müller et al. 2000), by selecting not all relevant items but those adding most information, a human user familiar with the system can potentially achieve better results than the automatic system when adding positive and negative feedback, so we feel that this is a good approximation. For novice users this is maybe unlikely and a novice user might thus simply add all relevant items which is the scenario we take for our feedback evaluation.

The advantage of perfect positive feedback is also that results become reproducible, so the exact same conditions can be reproduced for other techniques whereas manual user tests can depend strongly on the person supplying the feedback and thus do not allow comparing performance over several systems in a homogeneous way. User tests with a very similar system for several users have been published (Markonis et al. 2013), and we feel for this article a reproducible way for supplying relevance feedback is better. This does not replace user tests and more on interaction and interface design can be learned via such user tests.

5 Conclusions

This paper proposes the use of multi-modal information when applying relevance feedback to medical image retrieval. An experiment was set up to simulate the relevance feedback of a user on medical topics from ImageCLEF 2012.

In general, all the techniques evaluated in this study improve the performance, which shows the added value of relevance feedback. Text-based relevance feedback showed consistently good results. Visual techniques showed competitive performance for a small number of steps, underperforming in the remaining cases. The proposed multi-modal approaches show promising results slightly outperforming the text-based one but without statistical significance.

More fusion techniques are going to be evaluated in the future. Comparison to manual query refinement by users is considered in future plans to assess relevance feedback as a concept in medical image retrieval. The addition of semantic search is also of interest, to take advantage of the structured knowledge of the medical ontologies such as RadLexFootnote 4 (Radiology Lexicon) (Lanlotz 2006) and MeSHFootnote 5 (Medical Subject Headings) (WHSL Medical Subject Headings for PubMed Searching 2014) or other medical ontologies to model semantic knowledge.