Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Image retrieval and image classification have been extremely active research domains with hundreds of publications in the past 20 years [13]. Content-based image retrieval has been proposed for diagnosis aid, decision support and enabling similarity-based easy access to medical information [4, 5] ranging from similar case, to similar images and similar regions of interest.

One of the main domains of image retrieval has been the medical literature with millions of images being available [6, 7]. ImageCLEFmedFootnote 1 (the medical image retrieval task of the Cross Language Evaluation Forum) is an annual evaluation campaign on retrieval of images from the biomedical open access literature [8]. In the ImageCLEF medical task, usually 12–17 teams compare their approaches each year from 2004–2013 based on a variety of search tasks [9].

The Bag-of-Visual-Words (BoVW) is a visual description technique that aims at shortening the semantic gap by partitioning a low-level feature space into regions of the features space that potentially correspond to visual topics. These regions are called visual words in an analogy to text-based retrieval and the bag of words approach. An image can be described by assigning a visual word to each of the feature vectors that describe local regions or patches of the images (either via a dense grid sampling or interest points often based on saliency), and then representing the set of feature vectors by a histogram of the visual words. One of the most interesting characteristics of the BoVWs is that the set of visual words is created based on the actual data and therefore only topics present in the data are part of the visual vocabulary [10].

The creation of the vocabulary is normally based on a clustering method (e.g. k-means, DENCLUE) to identify local clusters in the feature space and then assigning a visual word to each of the cluster centers. This has been investigated previously, either by searching for the optimal number of visual words [11], by using various clustering algorithms [12] instead of the k-means or by selecting interest points to obtain the features [13].

Although the BoVW is widely used in the literature [14, 15] there is a strong performance variation within similar experiments when considering different vocabulary sizes [11]., making the choice of vocabulary size a crucial aspect of visual vocabularies that can vary very strongly. We hypothesize that this variance of the BoVW method is strongly related to the quality of the vocabulary used, understanding quality as the ability of the vocabulary to accurately describe useful concepts for the task. Therefore, we try to reduce the size of the vocabulary without reducing the performance of the method. The use of supervised clustering [16, 17] to force the clusters to a known number of classes was also considered as an option but it is against the notion of learning a variety of topics present in the data. Instead, we compute the latent semantic topics in the dataset in an unsupervised way by analyzing the probability of each word to occur. This allows to extract concepts or topics from a combination of various visual word types, since the topics are discovered based on the probability of co-occurrence of a set of visual words regardless of their origin. The resulting reduced vocabularies present two benefits over the full ones. First, a reduction of the descriptors leads to reduction of the computational cost of the online phase of retrieval but also in the offline indexing phase. This reduction becomes important in the context of large-scale databases or Big Data. The second benefit of the approach is that by removing non-meaningful visual words, the dataset description becomes more compact. A compact representation makes it easier to use neighbourhood-based classifiers, which tend to fail in high dimensional feature spaces due to the curse of dimensionality. Finally, a transformation of the descriptor is proposed combining the pruning of meaningless visual words and weighting meaningful words accordingly to their importance for the visual topics.

The rest of this chapter is organized as follows: Sect. 2 explains in details the materials and methods used with focus on the data set, the probabilistic latent semantic analysis and how it is used to remove meaningless visual words from the vocabulary. Section 3 contains factual details of results of the experiments run on the dataset, while Sect. 4 discusses them. Conclusions and future work are explained in Sect. 5.

2 Materials and Methods

In this section, further details on the data set and the techniques employed are given.

2.1 Data Set

Image modality filters are one of the characteristics of medical image retrieval that practitioners would like to see included in existing image search systems [18]. Medical image search engines such as GoldMinerFootnote 2 and YottalookFootnote 3 contain modality filters to allow users to focus retrieval results. Whereas DICOM headers often contain metadata that can be used to filter modalities, this information is lost when exporting images for publication in journals or conferences where images are stored as JPG, GIF or PNG files and are usually further processed (fewer grey levels cropping, ). In this case visual appearance is key to identify modalities or the caption text can be analyzed for respective keywords to identify modalities. The ImageCLEFmed evaluation campaign contains a modality classification task that is regarded as an essential part for image retrieval systems. In 2012, the modality classification data set contained 2,000 images from the medical literature organized in a hierarchy of 31 categories [19]. Figure 1 shows the hierarchical structure of modalities. All images in the dataset belong to a single leaf node in the hierarchy.

Fig. 1
figure 1

Hierarchy of modalities or image types considered in the modality classification task

The modality classification dataset is divided into two subsets of 1,000 images each, one for training and one for testing. The training set and its corresponding ground truth are made public for the groups to train and optimize their methods but the comparison is performed on a test set of which the ground truth is not known by the groups. Figure 2 shows the distribution of images across modalities in the training and test sets.

Fig. 2
figure 2

Distribution of images across modalities for the modality classification training and test sets

Besides modality classification, an image retrieval task is also performed during the benchmarking event where independent assessors judge the relevance of each document in the pool of results submitted by the groups. The retrieval task is performed on a dataset containing the full ImageCLEFmed data set, which in 2012 consisted of more than 306,000 images form the open access biomedical literature.

Both data sets were used in the experiments described in this article. Methods were first tested on the modality classification data set (training and testing) to investigate the effect of parameters on the system. Then, fewer parameter combinations were tested on the retrieval task with a larger data base.

2.2 Descriptors

In this section, the descriptors used in our experimental evaluation are presented. Scale Invariant Feature Transform (SIFT) and Bag-of-Colors (BoC) were chosen as images descriptors.

2.2.1 SIFT

In this work, images are described with a BoVW based on their SIFT [20] descriptors. This representation has been commonly used for image retrieval because it can be computed efficiently [15, 21, 22]. The SIFT descriptor is invariant to translations, rotations and scaling transformations and robust to moderate perspective transformations and illumination variations. SIFT encodes the salient aspects of the greylevel-images gradient in a local neighbourhood around each interest point.

2.2.2 Bag of Colors

BoC is used to extract a color signature from the images [23]. The method is based on BoVW image representation, which facilitates the fusion with the SIFT-BoVW descriptor. The CIELabFootnote 4 color space was used since it is a perceptually uniform color space [24]. A color vocabulary \(\mathcal{C} =\{ c_{1},\ldots,c_{100}\}\), with \(c_{i} = (L_{i},a_{i},b_{i}) \in CIELab\), is defined by automatically clustering the most frequently occurring colors in the images of a subset of the collection containing an equal number of images from the various classes.

The BoC of an image I is defined as a vector \(BoC =\{\bar{ c}_{1},\ldots,\bar{c}_{100}\}\) such that, for each pixel p k  ∈ I:

$$\displaystyle{\bar{c}_{i} =\sum _{ k=1}^{P}\sum _{ j=1}^{P}g_{ j}(p_{k})}$$

with P the number of pixels in the image I, where

$$\displaystyle{ g_{j}(p) = \left \{\begin{array}{l} 1\ if\ d(p,c_{j}) \leq d(p,c_{l})\\ 0\ otherwise\\ \end{array} \right. }$$
(1)

and d(x, y) is the Euclidean distance between x and y.

2.3 Vocabulary Pruning and Descriptor Transformation Using Probabilistic Latent Semantic Analysis

In spoken or written language, not all words contain the same amount of information. Specifically, the grammatical class of a word is tightly linked to the amount of meaning it conveys. E.g. nouns and adjectives (open grammatical classes) can be considered more informative than prepositions and pronouns (closed grammatical classes).

Similarly, in a vocabulary of N W visual words generated by clustering a feature space populated with training data, not all words are useful to describe the appearance of the visual instances.

From an information theoretical point of view, a bag of (visual) words containing L i elements can be seen as L i observations of a random variable W. The unpredictability or information content of the observation corresponding to the visual word w n is

$$\displaystyle{ I(w_{n}) = log\left ( \frac{1} {P(W = w_{n})}\right ) }$$
(2)

This explains why nouns or adjectives contain, in general, more information than prepositions or pronouns. Words belonging to a closed class are more probable than those belonging to a much richer class. According to Eq. 2, information is related to unlikelihood of a word.

In a bag of visual words scheme for visual understanding it is important to use very specific words with high discriminative power. On the other hand, using very specific words alone does not always allow to establish and recognize similarities. This can be done by establishing a concept that generalizes very specific words that share similar meanings into a less specific visual topic. E.g. in order to recognize the similarities between the (specific) words bird and fish we need a less specific topic such as animal.

A visual topic z is the representation of a generalized version of the visual appearance modeled by various visual words. It corresponds to an intermediate level between visual words and the complete understanding of visual information. A set of visual topics \(\mathcal{Z} = \left \{z_{1},\ldots,z_{N_{Z}}\right \}\) can be defined in a way that every visual word can belong to none, one or several visual topics, therefore establishing and possibly quantifying the relationships among words (see Fig. 3).

Fig. 3
figure 3

Conceptual model of visual topics, words and features. Whereas continuous features are the most informative descriptors from an information theoretical point of view, visual words generalize feature points that are close in the feature space. We propose visual topics as a higher generalization level, modelling partially shared meanings among words

2.3.1 Probabilistic Latent Semantic Analysis

Visual words are often referred to as an extension of the bag of words technique used in information retrieval from textual to visual data. Similarly, language modelling techniques have also been extended from text to visual words-based techniques [25, 26].

Latent Semantic Analysis (LSA) [27] is a language modelling technique that maps documents to a vector space of reduced dimensionality, called latent semantic space, based on a Singular Value Decomposition (SVD) of the terms-documents co-occurrence matrix. This technique was later extended to statistical models, called Probabilistic Latent Semantic Analysis (PLSA), by Hofmann [28]. PLSA removes restrictions of the purely algebraic former approach (namely, the linearity of the mapping).

Hofmann defines a generative model that states that the observed probability of a word or term \(w_{j},j \in 1,\ldots,M\) occurring in a given document \(d_{i},i \in 1,\ldots,N\), is linked to a latent or unobserved set of concepts or topics \(\mathcal{Z} =\{ z_{1},\ldots,z_{K}\)} that happen in the text:

$$\displaystyle{ P(w_{j}\vert d_{i}) =\sum _{ k=1}^{K}P(w_{ j}\vert z_{k})P(z_{k}\vert d_{i}). }$$
(3)

The model is fit via the EM (Expectation-Maximization) algorithm. For the expectation step:

$$\displaystyle{ P(z_{k}\vert d_{i},w_{j}) = \frac{P(w_{j}\vert z_{k})P(z_{k}\vert d_{i})} {\sum _{l=1}^{K}P(w_{j}\vert z_{l})P(z_{l}\vert d_{i})}. }$$
(4)

and for the maximization step:

$$\displaystyle\begin{array}{rcl} P(w_{j}\vert z_{k}) = \frac{\sum _{i=1}^{N}n(d_{i},w_{j})P(z_{k}\vert d_{i},w_{j})} {\sum _{m=1}^{M}\sum _{i=1}^{N}n(d_{i},w_{m})P(z_{k}\vert d_{i},w_{m})},& &{}\end{array}$$
(5)
$$\displaystyle\begin{array}{rcl} P(z_{k},d_{i}) = \frac{\sum _{j=1}^{M}n(d_{i},w_{j})P(z_{k}\vert d_{i},w_{j})} {n(d_{i})}.& &{}\end{array}$$
(6)

where n(d i , w j ) denotes the number of times the term w j occurred in document d i ; and \(n(d_{i}) =\sum _{j}(d_{i},w_{j})\) refers to the document length.

These steps are repeated until convergence or until a termination condition is met. As a result, two probability matrices are obtained: the word-topic probability matrix \(W_{M\times K} = (P(w_{j}\vert z_{k}))_{j,k}\) and the topic-document probability matrix \(D_{K\times N} = (P(z_{k}\vert d_{i}))_{k,i}\).

2.3.2 PLSA for Visual Words

The PLSA technique only requires a word-document co-occurrence matrix and therefore the technique can be referred to as feature-agnostic. Since it does not set any requirements on the nature of the low level features that yield these co-occurrence matrices (other than being discrete), the extension to visual words is simple. PLSA in combination with visual words for classification purposes was also applied in [29, 30].

In our approach, images are described in terms of a BoC in the CIELab color space and a BoVW based on SIFT descriptors. Therefore, the dataset can be described using the following co-occurrence matrices:

$$\displaystyle\begin{array}{rcl} C_{N\times N_{C}} = (n(d_{i},c_{j}))_{i,j},& &{}\end{array}$$
(7)
$$\displaystyle\begin{array}{rcl} S_{N\times N_{S}} = (n(d_{i},s_{l}))_{i,l},& &{}\end{array}$$
(8)

where N is the number of images in the dataset, N C the length of the color vocabulary, N S the length of the SIFT-based vocabulary and \(n(d_{i},c_{j})\) or \(n(d_{i},s_{l})\) is the number of occurrences of the color word c j or SIFT word s l occurring in the image d i .

2.3.3 Vocabulary Pruning

The key idea in our approach is that not only the color and SIFT vocabularies are over-complete and redundant individually for the dataset, but they may as well contain visual words that model the same latent topics. Therefore, a full color-SIFT representation of the dataset is obtained by concatenating the two matrices C and S into a single \(N \times (N_{C} + N_{S})\) visual features matrix V.

The matrix V is then analysed using the PLSA technique with a varying number of topics K and the resulting visual word-topic conditional probability matrices \(W_{(N_{C}+N_{S})\times K}\) are used to find the meaningless visual words that need to be removed from the vocabulary.

A visual word is considered meaningless if its conditional probability is below the significance threshold T k for every latent topic. Since each topic can be linked to a different number of visual words, the significance threshold is not an absolute value, but relative to each topic. In our approach, T k takes the value of the p T -th percentile of each topic. This allows to keep only the (100 − p T )% most significative visual words for each topic while removing the remaining visual words. A visual word can signify several topics (polysemic words) and several visual words can be equally significative for a given topic (synonyms). These factors, which are common in language modelling, have as a result that the vocabulary reduction cannot be estimated directly using the value of p T , since it depends on the distribution of synonyms and polysemic words in the experimental data model.

The number of latent topics as well as the value of the significant percentile are parameters of the technique presented in this paper. Section 3 explains the results of the experimental evaluation of the technique for various values of K and p T .

2.3.4 Meaningfulness-Based Descriptor Transformation

Instead of using a hard decision based on a meaningfulness threshold, a transformation can be defined to weight visual words according to their meaningfulness. The visual meaningfulness of a visual word w n is its maximum topic-based significance level:

$$\displaystyle{m_{n} = \left \{\begin{array}{l} \max _{j}\left \{t_{n,j}\right \}\ \mathrm{if}\ \max _{j}\left \{t_{n,j}\right \} \geq T_{meaning} \\ 0\ \mathrm{otherwise}\\ \end{array} \right.}$$

Let h be a histogram vector where each component represents the multiplicity of a visual word, and M a meaningfulness transformation matrix:

$$\displaystyle\begin{array}{rcl} \mathbf{h}& =& (n(w_{1}),n(w_{2}),\ldots,n(w_{N_{W}}))^{T}{}\end{array}$$
(9)
$$\displaystyle\begin{array}{rcl} \mathbf{M}& =& \left (\begin{array}{cccc} m_{1} & 0 &\cdots & 0 \\ 0 &m_{2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 &\cdots &m_{N_{W}} \end{array} \right ){}\end{array}$$
(10)

Then, the vector \(\mathbf{h^{M}} = (n(w_{1}^{M}),n(w_{2}^{M}),\ldots,n(w_{N_{W}}^{M}))^{T}\) is the histogram vector of visual words in the meaningfulness-transformed space.

$$\displaystyle\begin{array}{rcl} \mathbf{h^{M}}& =& \mathbf{M}\mathbf{h}{}\end{array}$$
(11)
$$\displaystyle\begin{array}{rcl} n(w_{i}^{M})& =& m_{ i} \cdot n(w_{i}){}\end{array}$$
(12)

2.4 Experiments

Several experiments were run to evaluate the performance of the vocabulary pruning technique. In this section, the experiments are described.

2.4.1 Classification with a Truncated Descriptor

Preliminary experiments on the vocabulary pruning technique over the training set were based on removing meaningless visual words from the descriptors but not from the vocabulary (i.e. the histogram values for meaningful visual words remain the same and therefore histograms are no longer normalized).

By running a twofold cross validation on the modality classification training set, the effect of the parameters K (number of latent topics) and p T (significant percentile threshold) was investigated. All descriptors were computed using the full vocabulary and visual words below the significance threshold were later removed from the descriptors. No fusion rules were applied to the SIFT-BoVW and BoC descriptors.

2.4.2 Classification with a Reduced Vocabulary

In this experiment, meaningless visual words were removed from the vocabulary, histograms were recomputed and therefore stayed normalized. Due to the presence of very unbalanced classes in the dataset, experiments included twofold cross-validation on the training set and cross-validation based on separate training and test set. The same experiments were run with the full vocabularies.

Classification using the SIFT-BoVW and BoC can benefit from a fusion technique to include color and texture information. The similarity scores were calculated using both descriptors separately and the CombMNZ fusion rule [31] was used to obtain final scores. Images were classified using a weighted k-NN (k-Nearest Neighbors) voting [32]. Experiments were run with various k values for the voting.

2.4.3 Retrieval with a Reduced Vocabulary Over the Complete Data Set

In this experiment, the complete ImageCLEF dataset for medical images was indexed for retrieval. The number of images in the dataset (306,000) is sufficiently large to allow measures on speed gain when reducing the vocabulary. Retrieval was performed using the fusion rule described in Sect. 2.4.2. The retrieval experiment consisted of 22 topics (each consisting of 1–7 query images), corresponding to the ImageCLEF 2012 medical track.

2.4.4 Classification Using Descriptor Transformation

In order to assess the impact of vocabulary size and meaningfulness-based weighting of visual words, an experimental evaluation based on the SIFT description of the images was performed. Images were described with a BoVW based on SIFT [20] descriptors. This representation has been commonly used for image retrieval because it can be computed efficiently [15, 21, 22]. The SIFT descriptor is invariant to translations, rotations and scaling transformations and robust to moderate perspective transformations and illumination variations. SIFT encodes the salient aspects of the grey-level images gradient in a local neighborhood around each interest point.

Evaluation with separate training and test sets was performed using all combinations of the following parameters:

  1. 1.

    Two SIFT-based visual vocabularies with 100 and 500 visual words.

  2. 2.

    A varying number of visual topics from 25 to 350 in steps of 25.

  3. 3.

    A varying meaningfulness threshold from 50 to 100 %.

3 Results

In this section a summary of the results for each experiment is given.

3.1 Truncated Descriptor

This section explains the results of the experiment described in Sect. 2.4.1. Since the descriptor requires the full vocabulary before performing the truncation of meaningless words no speed gain in the offline phase was obtained.

Figure 4a shows the results of the accuracy obtained using a 1-NN classifier compared to the effect of truncating descriptors on vocabulary size in Fig. 4b. The number of latent topics K varies from 10 to 100 in steps of 10 and the significant percentile threshold for each topic p T from 1 to 99.

Fig. 4
figure 4

Evaluation of descriptor truncation over the modality classification training set using cross-validation. 1-NN classification was performed for a varying number of latent topics K and significant percentile p T . (a) Effect on classification accuracy. (b) Effect on effective vocabulary size

The effect of increasing the significant percentile is much stronger on the number of visual words used than on the classification accuracy. Similarly, the number of latent topics has a limited impact on accuracy while having a strong impact on the vocabulary size. Rather unsurprisingly, the fewer latent topics considered, the easier it becomes to find meaningless visual words. Also, vocabulary sizes tend to be more similar for various K values when p T is high.

Statistical significance tests were run to compare the results distributions using the truncated descriptors. These tests failed to show a statistically significant difference between classification using the full descriptor or any of the reduced descriptors over the training set.

3.2 Reduced Vocabulary Over Modality Classification Training and Test Sets

This section contains a summary of the results of the experiments described in Sect. 2.4.2.

Table 1 contains a summary of the best results for a significant percentile p T  = 80 and a varying number of topics. It also includes the results obtained with the full vocabulary using the same classifier. Although it is not shown in the table, all of the removed words for p T  = 80 belonged to the SIFT-BoVW vocabulary.

Table 1 Best classification results (varying the k-NN voting) over the training set for varying number of latent topics and a fixed significant percentile p T  = 80

Table 2 contains the corresponding results for a 99-percentile as significance threshold. In this experiment meaningless words were found in both the BoC and the SIFT-BoVW vocabularies.

Table 2 Best classification results (varying the k-NN voting) over the training set for varying number of latent topics and a fixed significant percentile p T  = 99

Tables 3 and 4 contain the corresponding results over the test set when performing cross-validation with separate test and training sets. The vocabularies used are the same as those from Tables 1 and 2.

Table 3 Best classification results (varying the k-NN voting) over the test set for varying number of latent topics and a fixed significant percentile p T  = 80
Table 4 Best classification results (varying the k-NN voting) over the test set for a varying number of latent topics and a fixed significant percentile p T  = 99

3.3 Reduced Vocabulary for the Retrieval Task

Based on the results in Sect. 3.2, two vocabularies were selected for obtaining results in the ImageCLEFmed retrieval task. The smallest vocabulary corresponds to the p T  = 99 and 10 latent topics vocabulary, whereas the most accurate vocabulary was the p T  = 80 and 10 latent topics.

Table 5 contains a summary of the results in terms of time required for indexing the complete dataset for the most accurate configuration (p T  = 80 and 10 latent topics), the smallest vocabulary (p T  = 99 and 10 latent topics) and the complete vocabulary.

Table 5 Average indexing time per image for the smallest vocabulary, the most accurate and the complete vocabulary

Table 6 shows the results when performing the retrieval task on the complete ImageCLEFmed 2012 dataset with the selected vocabularies for each of the 22 topics or queries.

Table 6 Results of retrieval experiments for each vocabulary

(b) Mean Average Precision (MAP) across all topics

 Vocabulary used

MAP (%)

 Complete vocabulary

6.51 

p T  = 80, K = 10

6.52 

p T  = 99, K = 10

1.51 

(c) Average execution times of the online phase for a single query image

 Vocabulary used

Online retrieval time

 Complete vocabulary

125 s

p T  = 80, K = 10

107 s

p T  = 99, K = 10

45 s

3.4 Descriptor Transformation and Effect on Vocabulary Size

Using the parameters explained in Sect. 2.4.4 and applying the transformation proposed in Sect. 2.3.4, the effect of the initial vocabulary size and the meaningfulness threshold can be studied.

Figure 5 shows the effect of the transformation when using various meaningfulness thresholds on two vocabularies.

Fig. 5
figure 5

Evaluation of descriptor transformation using the proposed meaningfulness transform over the modality classification task set using training and test sets. 1-NN classification was performed for a varying number of latent topics and meaningfulness threshold

4 Discussion

As shown in Fig. 4 the impact of PLSA-based pruning has a stronger effect on the size of the vocabulary than on the performance of the classifiers. Table 2 shows that a vocabulary reduction of up to 91.72 % can be obtained with a comparable accuracy for the same classifier. For the 99-percentile, the best classification method with the reduced vocabulary always obtains higher accuracy than the same classification method on the full vocabulary.

However, significance tests have failed to show a statistically significant difference between the various accuracies obtained. Therefore, the main contribution of this work is a method that can enormously reduces visual word vocabularies while obtaining a comparable (and often slightly higher) accuracy.

Another important aspect of the results is that the PLSA-based pruning finds a more meaningful vocabulary than the SIFT-BoVW one. Whereas in the complete vocabulary the SIFT-based words outnumbered the color words by a factor of 2.38, this relationship is inverted in the smallest vocabulary where there are more than two color words for each SIFT-based word.

Results in Table 5 show that the reduction of the indexing time is smaller than the reduction in the number of words. However, the smallest vocabulary presents an indexing time 55.9 % lower than the complete vocabulary. Studies have shown that the reduction of the number of features used as a descriptor can increase the speed of online retrieval [33]. This is confirmed in Table 5c, with retrieval times up to 64 % lower when using the smallest vocabulary.

Results in Tables 14 show that the performance is much better for modality classification tasks than for retrieval in the complete ImageCLEFmed dataset (see Table 6), probably due to the size of the training set used (1,000 images) in comparison with the 306,000 images in the complete dataset. For the retrieval task, the vocabularies present a comparable performance in terms of recall, being the p T  = 80, K = 10 vocabulary slightly better than the others. However, mean average precision strongly varies between large vocabularies and the smallest vocabulary (p T  = 99, K = 10).

Evaluation of the proposed meaningfulness transformation shows an improvement in accuracy as well as the impact on the vocabulary size already found in the PLSA-based pruning. The increase of accuracy is non-negligible, and passes statistical significance tests. The accuracy is increased for both original vocabularies tested, and there is a slight saturation effect where the size of the descriptor can be safely reduced without impact on accuracy. Massive reductions of the descriptors, strongly reduce performance as well.

It can be discussed that de benefits of the PLSA-based pruning presented are not the ability to discover new and meaningful visual words for retrieval but the ability to recognize those visual words that convey most of the meaning among the ones present in the vocabulary. However, the meaningfulness transform is able to improve the accuracy by increasing the relative weight of the most meaningful visual words.

5 Conclusions and Future Work

In this work a vocabulary pruning and description transformation method based on probabilistic latent semantic analysis of visual words for medical image retrieval and classification is presented. The selection of optimal visual words is performed by removing visual words with a conditional probability over all learnt latent topics that is below a given threshold, the remaining (meaningful) words are weighted according to the largest conditional probability. The process is completely unsupervised, since the learning of the topics is performed without taking into consideration the number of classes or what is the actual class assigned to each image. Therefore, it can be used to reduce massive fine-grained vocabularies to smaller vocabularies that contain only the most meaningful visual words even before training the classifier. To obtain these fine-grained vocabularies, simple clustering algorithms can be used to produce a large number of small clusters that later will be pruned using the methods explained in this paper. Smaller clusters are supposed to encode subtle visual differences among images, which will be preserved by the PLSA-based pruning if they are meaningful for some latent topic. Future applications of the technique also include the use of multiple vocabularies that can be merged and pruned as a single set of discrete features.

We are currently extending the techniques to images obtained for clinical use, where the use of low-dimensional descriptors can achieve fast and accurate characterization of large-scale datasets of high-dimensional (3D, 4D, multimodal) images. This is expected to lead to different results as for the modality classification tasks and retrieval tasks from the literature color plays a more important roles than for most clinical images. Still, the possibility to reduce visual vocabularies strongly can lead to larger base vocabularies that can potentially capture the image content much better but can then be reduced for efficient retrieval.