Meaningful Bags of Words for Medical Image Classification and Retrieval

Rodríguez, Antonio Foncubierta; de Herrera, Alba García Seco; Müller, Henning

doi:10.1007/978-3-319-17963-6_5

Antonio Foncubierta Rodríguez⁴,
Alba García Seco de Herrera⁴ &
Henning Müller⁴

728 Accesses
3 Citations

Abstract

Content-based medical image retrieval has been proposed as a technique that allows not only for easy access to images from the relevant literature and electronic health records but also for training physicians, for research and clinical decision support. The bag-of-visual-words approach is a widely used technique that tries to shorten the semantic gap by learning meaningful features from the dataset and describing documents and images in terms of the histogram of these features. Visual vocabularies are often redundant, over-complete and noisy. Larger than required vocabularies lead to high-dimensional feature spaces, which present important disadvantages with the curse of dimensionality and computational cost being the most obvious ones. In this article a visual vocabulary pruning and descriptor transformation technique is presented. It enormously reduces the amount of required words to describe a medical image dataset with no significant effect on the accuracy. Results show that a reduction of up to 90 % can be achieved without impact on the system performance. Obtaining a more compact representation of a document enables multimodal description as well as using classifiers requiring low-dimensional representations.

Access provided by Autonomous University of Puebla. Download chapter PDF

The Image Classification with Different Types of Image Features

Color and texture applied to a signature-based bag of visual words method for image retrieval

Article 27 September 2016

Content-based image retrieval and semantic automatic image annotation based on the weighted average of triangular histograms using support vector machine

Article 21 June 2017

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Image retrieval and image classification have been extremely active research domains with hundreds of publications in the past 20 years [1–3]. Content-based image retrieval has been proposed for diagnosis aid, decision support and enabling similarity-based easy access to medical information [4, 5] ranging from similar case, to similar images and similar regions of interest.

One of the main domains of image retrieval has been the medical literature with millions of images being available [6, 7]. ImageCLEFmed^{Footnote 1} (the medical image retrieval task of the Cross Language Evaluation Forum) is an annual evaluation campaign on retrieval of images from the biomedical open access literature [8]. In the ImageCLEF medical task, usually 12–17 teams compare their approaches each year from 2004–2013 based on a variety of search tasks [9].

The Bag-of-Visual-Words (BoVW) is a visual description technique that aims at shortening the semantic gap by partitioning a low-level feature space into regions of the features space that potentially correspond to visual topics. These regions are called visual words in an analogy to text-based retrieval and the bag of words approach. An image can be described by assigning a visual word to each of the feature vectors that describe local regions or patches of the images (either via a dense grid sampling or interest points often based on saliency), and then representing the set of feature vectors by a histogram of the visual words. One of the most interesting characteristics of the BoVWs is that the set of visual words is created based on the actual data and therefore only topics present in the data are part of the visual vocabulary [10].

The creation of the vocabulary is normally based on a clustering method (e.g. k-means, DENCLUE) to identify local clusters in the feature space and then assigning a visual word to each of the cluster centers. This has been investigated previously, either by searching for the optimal number of visual words [11], by using various clustering algorithms [12] instead of the k-means or by selecting interest points to obtain the features [13].

Although the BoVW is widely used in the literature [14, 15] there is a strong performance variation within similar experiments when considering different vocabulary sizes [11]., making the choice of vocabulary size a crucial aspect of visual vocabularies that can vary very strongly. We hypothesize that this variance of the BoVW method is strongly related to the quality of the vocabulary used, understanding quality as the ability of the vocabulary to accurately describe useful concepts for the task. Therefore, we try to reduce the size of the vocabulary without reducing the performance of the method. The use of supervised clustering [16, 17] to force the clusters to a known number of classes was also considered as an option but it is against the notion of learning a variety of topics present in the data. Instead, we compute the latent semantic topics in the dataset in an unsupervised way by analyzing the probability of each word to occur. This allows to extract concepts or topics from a combination of various visual word types, since the topics are discovered based on the probability of co-occurrence of a set of visual words regardless of their origin. The resulting reduced vocabularies present two benefits over the full ones. First, a reduction of the descriptors leads to reduction of the computational cost of the online phase of retrieval but also in the offline indexing phase. This reduction becomes important in the context of large-scale databases or Big Data. The second benefit of the approach is that by removing non-meaningful visual words, the dataset description becomes more compact. A compact representation makes it easier to use neighbourhood-based classifiers, which tend to fail in high dimensional feature spaces due to the curse of dimensionality. Finally, a transformation of the descriptor is proposed combining the pruning of meaningless visual words and weighting meaningful words accordingly to their importance for the visual topics.

The rest of this chapter is organized as follows: Sect. 2 explains in details the materials and methods used with focus on the data set, the probabilistic latent semantic analysis and how it is used to remove meaningless visual words from the vocabulary. Section 3 contains factual details of results of the experiments run on the dataset, while Sect. 4 discusses them. Conclusions and future work are explained in Sect. 5.

2 Materials and Methods

In this section, further details on the data set and the techniques employed are given.

2.1 Data Set

Image modality filters are one of the characteristics of medical image retrieval that practitioners would like to see included in existing image search systems [18]. Medical image search engines such as GoldMiner^{Footnote 2} and Yottalook^{Footnote 3} contain modality filters to allow users to focus retrieval results. Whereas DICOM headers often contain metadata that can be used to filter modalities, this information is lost when exporting images for publication in journals or conferences where images are stored as JPG, GIF or PNG files and are usually further processed (fewer grey levels cropping, …). In this case visual appearance is key to identify modalities or the caption text can be analyzed for respective keywords to identify modalities. The ImageCLEFmed evaluation campaign contains a modality classification task that is regarded as an essential part for image retrieval systems. In 2012, the modality classification data set contained 2,000 images from the medical literature organized in a hierarchy of 31 categories [19]. Figure 1 shows the hierarchical structure of modalities. All images in the dataset belong to a single leaf node in the hierarchy.

The modality classification dataset is divided into two subsets of 1,000 images each, one for training and one for testing. The training set and its corresponding ground truth are made public for the groups to train and optimize their methods but the comparison is performed on a test set of which the ground truth is not known by the groups. Figure 2 shows the distribution of images across modalities in the training and test sets.

Besides modality classification, an image retrieval task is also performed during the benchmarking event where independent assessors judge the relevance of each document in the pool of results submitted by the groups. The retrieval task is performed on a dataset containing the full ImageCLEFmed data set, which in 2012 consisted of more than 306,000 images form the open access biomedical literature.

Both data sets were used in the experiments described in this article. Methods were first tested on the modality classification data set (training and testing) to investigate the effect of parameters on the system. Then, fewer parameter combinations were tested on the retrieval task with a larger data base.

2.2 Descriptors

In this section, the descriptors used in our experimental evaluation are presented. Scale Invariant Feature Transform (SIFT) and Bag-of-Colors (BoC) were chosen as images descriptors.

2.2.1 SIFT

In this work, images are described with a BoVW based on their SIFT [20] descriptors. This representation has been commonly used for image retrieval because it can be computed efficiently [15, 21, 22]. The SIFT descriptor is invariant to translations, rotations and scaling transformations and robust to moderate perspective transformations and illumination variations. SIFT encodes the salient aspects of the greylevel-images gradient in a local neighbourhood around each interest point.

2.2.2 Bag of Colors

BoC is used to extract a color signature from the images [23]. The method is based on BoVW image representation, which facilitates the fusion with the SIFT-BoVW descriptor. The CIELab^{Footnote 4} color space was used since it is a perceptually uniform color space [24]. A color vocabulary $\mathcal{C} =\{ c_{1},\ldots,c_{100}\}$, with $c_{i} = (L_{i},a_{i},b_{i}) \in CIELab$, is defined by automatically clustering the most frequently occurring colors in the images of a subset of the collection containing an equal number of images from the various classes.

The BoC of an image I is defined as a vector $BoC =\{\bar{ c}_{1},\ldots,\bar{c}_{100}\}$ such that, for each pixel p _k ∈ I:

$$\displaystyle{\bar{c}_{i} =\sum _{ k=1}^{P}\sum _{ j=1}^{P}g_{ j}(p_{k})}$$

with P the number of pixels in the image I, where

$$\displaystyle{ g_{j}(p) = \left \{\begin{array}{l} 1\ if\ d(p,c_{j}) \leq d(p,c_{l})\\ 0\ otherwise\\ \end{array} \right. }$$

(1)

and d(x, y) is the Euclidean distance between x and y.

2.3 Vocabulary Pruning and Descriptor Transformation Using Probabilistic Latent Semantic Analysis

In spoken or written language, not all words contain the same amount of information. Specifically, the grammatical class of a word is tightly linked to the amount of meaning it conveys. E.g. nouns and adjectives (open grammatical classes) can be considered more informative than prepositions and pronouns (closed grammatical classes).

Similarly, in a vocabulary of N _W visual words generated by clustering a feature space populated with training data, not all words are useful to describe the appearance of the visual instances.

From an information theoretical point of view, a bag of (visual) words containing L _i elements can be seen as L _i observations of a random variable W. The unpredictability or information content of the observation corresponding to the visual word w _n is

$$\displaystyle{ I(w_{n}) = log\left ( \frac{1} {P(W = w_{n})}\right ) }$$

(2)

This explains why nouns or adjectives contain, in general, more information than prepositions or pronouns. Words belonging to a closed class are more probable than those belonging to a much richer class. According to Eq. 2, information is related to unlikelihood of a word.

In a bag of visual words scheme for visual understanding it is important to use very specific words with high discriminative power. On the other hand, using very specific words alone does not always allow to establish and recognize similarities. This can be done by establishing a concept that generalizes very specific words that share similar meanings into a less specific visual topic. E.g. in order to recognize the similarities between the (specific) words bird and fish we need a less specific topic such as animal.

A visual topic z is the representation of a generalized version of the visual appearance modeled by various visual words. It corresponds to an intermediate level between visual words and the complete understanding of visual information. A set of visual topics $\mathcal{Z} = \left \{z_{1},\ldots,z_{N_{Z}}\right \}$ can be defined in a way that every visual word can belong to none, one or several visual topics, therefore establishing and possibly quantifying the relationships among words (see Fig. 3).

2.3.1 Probabilistic Latent Semantic Analysis

Visual words are often referred to as an extension of the bag of words technique used in information retrieval from textual to visual data. Similarly, language modelling techniques have also been extended from text to visual words-based techniques [25, 26].

Latent Semantic Analysis (LSA) [27] is a language modelling technique that maps documents to a vector space of reduced dimensionality, called latent semantic space, based on a Singular Value Decomposition (SVD) of the terms-documents co-occurrence matrix. This technique was later extended to statistical models, called Probabilistic Latent Semantic Analysis (PLSA), by Hofmann [28]. PLSA removes restrictions of the purely algebraic former approach (namely, the linearity of the mapping).

Hofmann defines a generative model that states that the observed probability of a word or term $w_{j},j \in 1,\ldots,M$ occurring in a given document $d_{i},i \in 1,\ldots,N$, is linked to a latent or unobserved set of concepts or topics $\mathcal{Z} =\{ z_{1},\ldots,z_{K}$} that happen in the text:

$$\displaystyle{ P(w_{j}\vert d_{i}) =\sum _{ k=1}^{K}P(w_{ j}\vert z_{k})P(z_{k}\vert d_{i}). }$$

(3)

The model is fit via the EM (Expectation-Maximization) algorithm. For the expectation step:

$$\displaystyle{ P(z_{k}\vert d_{i},w_{j}) = \frac{P(w_{j}\vert z_{k})P(z_{k}\vert d_{i})} {\sum _{l=1}^{K}P(w_{j}\vert z_{l})P(z_{l}\vert d_{i})}. }$$

(4)

and for the maximization step:

$$\displaystyle\begin{array}{rcl} P(w_{j}\vert z_{k}) = \frac{\sum _{i=1}^{N}n(d_{i},w_{j})P(z_{k}\vert d_{i},w_{j})} {\sum _{m=1}^{M}\sum _{i=1}^{N}n(d_{i},w_{m})P(z_{k}\vert d_{i},w_{m})},& &{}\end{array}$$

(5)

$$\displaystyle\begin{array}{rcl} P(z_{k},d_{i}) = \frac{\sum _{j=1}^{M}n(d_{i},w_{j})P(z_{k}\vert d_{i},w_{j})} {n(d_{i})}.& &{}\end{array}$$

(6)

where n(d _i, w _j) denotes the number of times the term w _j occurred in document d _i; and $n(d_{i}) =\sum _{j}(d_{i},w_{j})$ refers to the document length.

These steps are repeated until convergence or until a termination condition is met. As a result, two probability matrices are obtained: the word-topic probability matrix $W_{M\times K} = (P(w_{j}\vert z_{k}))_{j,k}$ and the topic-document probability matrix $D_{K\times N} = (P(z_{k}\vert d_{i}))_{k,i}$.

2.3.2 PLSA for Visual Words

The PLSA technique only requires a word-document co-occurrence matrix and therefore the technique can be referred to as feature-agnostic. Since it does not set any requirements on the nature of the low level features that yield these co-occurrence matrices (other than being discrete), the extension to visual words is simple. PLSA in combination with visual words for classification purposes was also applied in [29, 30].

In our approach, images are described in terms of a BoC in the CIELab color space and a BoVW based on SIFT descriptors. Therefore, the dataset can be described using the following co-occurrence matrices:

$$\displaystyle\begin{array}{rcl} C_{N\times N_{C}} = (n(d_{i},c_{j}))_{i,j},& &{}\end{array}$$

(7)

$$\displaystyle\begin{array}{rcl} S_{N\times N_{S}} = (n(d_{i},s_{l}))_{i,l},& &{}\end{array}$$

(8)

where N is the number of images in the dataset, N _C the length of the color vocabulary, N _S the length of the SIFT-based vocabulary and $n(d_{i},c_{j})$ or $n(d_{i},s_{l})$ is the number of occurrences of the color word c _j or SIFT word s _l occurring in the image d _i.

2.3.3 Vocabulary Pruning

The key idea in our approach is that not only the color and SIFT vocabularies are over-complete and redundant individually for the dataset, but they may as well contain visual words that model the same latent topics. Therefore, a full color-SIFT representation of the dataset is obtained by concatenating the two matrices C and S into a single $N \times (N_{C} + N_{S})$ visual features matrix V.

The matrix V is then analysed using the PLSA technique with a varying number of topics K and the resulting visual word-topic conditional probability matrices $W_{(N_{C}+N_{S})\times K}$ are used to find the meaningless visual words that need to be removed from the vocabulary.

A visual word is considered meaningless if its conditional probability is below the significance threshold T _k for every latent topic. Since each topic can be linked to a different number of visual words, the significance threshold is not an absolute value, but relative to each topic. In our approach, T _k takes the value of the p _T-th percentile of each topic. This allows to keep only the (100 − p _T)% most significative visual words for each topic while removing the remaining visual words. A visual word can signify several topics (polysemic words) and several visual words can be equally significative for a given topic (synonyms). These factors, which are common in language modelling, have as a result that the vocabulary reduction cannot be estimated directly using the value of p _T, since it depends on the distribution of synonyms and polysemic words in the experimental data model.

The number of latent topics as well as the value of the significant percentile are parameters of the technique presented in this paper. Section 3 explains the results of the experimental evaluation of the technique for various values of K and p _T.

2.3.4 Meaningfulness-Based Descriptor Transformation

Instead of using a hard decision based on a meaningfulness threshold, a transformation can be defined to weight visual words according to their meaningfulness. The visual meaningfulness of a visual word w _n is its maximum topic-based significance level:

$$\displaystyle{m_{n} = \left \{\begin{array}{l} \max _{j}\left \{t_{n,j}\right \}\ \mathrm{if}\ \max _{j}\left \{t_{n,j}\right \} \geq T_{meaning} \\ 0\ \mathrm{otherwise}\\ \end{array} \right.}$$

Let h be a histogram vector where each component represents the multiplicity of a visual word, and M a meaningfulness transformation matrix:

$$\displaystyle\begin{array}{rcl} \mathbf{h}& =& (n(w_{1}),n(w_{2}),\ldots,n(w_{N_{W}}))^{T}{}\end{array}$$

(9)

$$\displaystyle\begin{array}{rcl} \mathbf{M}& =& \left (\begin{array}{cccc} m_{1} & 0 &\cdots & 0 \\ 0 &m_{2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 &\cdots &m_{N_{W}} \end{array} \right ){}\end{array}$$

(10)

Then, the vector $\mathbf{h^{M}} = (n(w_{1}^{M}),n(w_{2}^{M}),\ldots,n(w_{N_{W}}^{M}))^{T}$ is the histogram vector of visual words in the meaningfulness-transformed space.

$$\displaystyle\begin{array}{rcl} \mathbf{h^{M}}& =& \mathbf{M}\mathbf{h}{}\end{array}$$

(11)

$$\displaystyle\begin{array}{rcl} n(w_{i}^{M})& =& m_{ i} \cdot n(w_{i}){}\end{array}$$

(12)

2.4 Experiments

Several experiments were run to evaluate the performance of the vocabulary pruning technique. In this section, the experiments are described.

2.4.1 Classification with a Truncated Descriptor

Preliminary experiments on the vocabulary pruning technique over the training set were based on removing meaningless visual words from the descriptors but not from the vocabulary (i.e. the histogram values for meaningful visual words remain the same and therefore histograms are no longer normalized).

By running a twofold cross validation on the modality classification training set, the effect of the parameters K (number of latent topics) and p _T (significant percentile threshold) was investigated. All descriptors were computed using the full vocabulary and visual words below the significance threshold were later removed from the descriptors. No fusion rules were applied to the SIFT-BoVW and BoC descriptors.

2.4.2 Classification with a Reduced Vocabulary

In this experiment, meaningless visual words were removed from the vocabulary, histograms were recomputed and therefore stayed normalized. Due to the presence of very unbalanced classes in the dataset, experiments included twofold cross-validation on the training set and cross-validation based on separate training and test set. The same experiments were run with the full vocabularies.

Classification using the SIFT-BoVW and BoC can benefit from a fusion technique to include color and texture information. The similarity scores were calculated using both descriptors separately and the CombMNZ fusion rule [31] was used to obtain final scores. Images were classified using a weighted k-NN (k-Nearest Neighbors) voting [32]. Experiments were run with various k values for the voting.

2.4.3 Retrieval with a Reduced Vocabulary Over the Complete Data Set

In this experiment, the complete ImageCLEF dataset for medical images was indexed for retrieval. The number of images in the dataset (306,000) is sufficiently large to allow measures on speed gain when reducing the vocabulary. Retrieval was performed using the fusion rule described in Sect. 2.4.2. The retrieval experiment consisted of 22 topics (each consisting of 1–7 query images), corresponding to the ImageCLEF 2012 medical track.

2.4.4 Classification Using Descriptor Transformation

In order to assess the impact of vocabulary size and meaningfulness-based weighting of visual words, an experimental evaluation based on the SIFT description of the images was performed. Images were described with a BoVW based on SIFT [20] descriptors. This representation has been commonly used for image retrieval because it can be computed efficiently [15, 21, 22]. The SIFT descriptor is invariant to translations, rotations and scaling transformations and robust to moderate perspective transformations and illumination variations. SIFT encodes the salient aspects of the grey-level images gradient in a local neighborhood around each interest point.

Evaluation with separate training and test sets was performed using all combinations of the following parameters:

1.
Two SIFT-based visual vocabularies with 100 and 500 visual words.
2.
A varying number of visual topics from 25 to 350 in steps of 25.
3.
A varying meaningfulness threshold from 50 to 100 %.

3 Results

In this section a summary of the results for each experiment is given.

3.1 Truncated Descriptor

This section explains the results of the experiment described in Sect. 2.4.1. Since the descriptor requires the full vocabulary before performing the truncation of meaningless words no speed gain in the offline phase was obtained.

Figure 4a shows the results of the accuracy obtained using a 1-NN classifier compared to the effect of truncating descriptors on vocabulary size in Fig. 4b. The number of latent topics K varies from 10 to 100 in steps of 10 and the significant percentile threshold for each topic p _T from 1 to 99.

The effect of increasing the significant percentile is much stronger on the number of visual words used than on the classification accuracy. Similarly, the number of latent topics has a limited impact on accuracy while having a strong impact on the vocabulary size. Rather unsurprisingly, the fewer latent topics considered, the easier it becomes to find meaningless visual words. Also, vocabulary sizes tend to be more similar for various K values when p _T is high.

Statistical significance tests were run to compare the results distributions using the truncated descriptors. These tests failed to show a statistically significant difference between classification using the full descriptor or any of the reduced descriptors over the training set.

3.2 Reduced Vocabulary Over Modality Classification Training and Test Sets

This section contains a summary of the results of the experiments described in Sect. 2.4.2.

Table 1 contains a summary of the best results for a significant percentile p _T = 80 and a varying number of topics. It also includes the results obtained with the full vocabulary using the same classifier. Although it is not shown in the table, all of the removed words for p _T = 80 belonged to the SIFT-BoVW vocabulary.

Table 1 Best classification results (varying the k-NN voting) over the training set for varying number of latent topics and a fixed significant percentile p _T = 80

Full size table

Table 2 contains the corresponding results for a 99-percentile as significance threshold. In this experiment meaningless words were found in both the BoC and the SIFT-BoVW vocabularies.

Table 2 Best classification results (varying the k-NN voting) over the training set for varying number of latent topics and a fixed significant percentile p _T = 99

Full size table

Tables 3 and 4 contain the corresponding results over the test set when performing cross-validation with separate test and training sets. The vocabularies used are the same as those from Tables 1 and 2.

Table 3 Best classification results (varying the k-NN voting) over the test set for varying number of latent topics and a fixed significant percentile p _T = 80

Full size table

Table 4 Best classification results (varying the k-NN voting) over the test set for a varying number of latent topics and a fixed significant percentile p _T = 99

Full size table

3.3 Reduced Vocabulary for the Retrieval Task

Based on the results in Sect. 3.2, two vocabularies were selected for obtaining results in the ImageCLEFmed retrieval task. The smallest vocabulary corresponds to the p _T = 99 and 10 latent topics vocabulary, whereas the most accurate vocabulary was the p _T = 80 and 10 latent topics.

Table 5 contains a summary of the results in terms of time required for indexing the complete dataset for the most accurate configuration (p _T = 80 and 10 latent topics), the smallest vocabulary (p _T = 99 and 10 latent topics) and the complete vocabulary.

Table 5 Average indexing time per image for the smallest vocabulary, the most accurate and the complete vocabulary

Full size table

Table 6 shows the results when performing the retrieval task on the complete ImageCLEFmed 2012 dataset with the selected vocabularies for each of the 22 topics or queries.

Table 6 Results of retrieval experiments for each vocabulary

Full size table

(b) Mean Average Precision (MAP) across all topics

Vocabulary used	MAP (%)
Complete vocabulary	6.51
p _T = 80, K = 10	6.52
p _T = 99, K = 10	1.51

(c) Average execution times of the online phase for a single query image

Vocabulary used	Online retrieval time
Complete vocabulary	125 s
p _T = 80, K = 10	107 s
p _T = 99, K = 10	45 s

3.4 Descriptor Transformation and Effect on Vocabulary Size

Using the parameters explained in Sect. 2.4.4 and applying the transformation proposed in Sect. 2.3.4, the effect of the initial vocabulary size and the meaningfulness threshold can be studied.

Figure 5 shows the effect of the transformation when using various meaningfulness thresholds on two vocabularies.

4 Discussion

As shown in Fig. 4 the impact of PLSA-based pruning has a stronger effect on the size of the vocabulary than on the performance of the classifiers. Table 2 shows that a vocabulary reduction of up to 91.72 % can be obtained with a comparable accuracy for the same classifier. For the 99-percentile, the best classification method with the reduced vocabulary always obtains higher accuracy than the same classification method on the full vocabulary.

However, significance tests have failed to show a statistically significant difference between the various accuracies obtained. Therefore, the main contribution of this work is a method that can enormously reduces visual word vocabularies while obtaining a comparable (and often slightly higher) accuracy.

Another important aspect of the results is that the PLSA-based pruning finds a more meaningful vocabulary than the SIFT-BoVW one. Whereas in the complete vocabulary the SIFT-based words outnumbered the color words by a factor of 2.38, this relationship is inverted in the smallest vocabulary where there are more than two color words for each SIFT-based word.

Results in Table 5 show that the reduction of the indexing time is smaller than the reduction in the number of words. However, the smallest vocabulary presents an indexing time 55.9 % lower than the complete vocabulary. Studies have shown that the reduction of the number of features used as a descriptor can increase the speed of online retrieval [33]. This is confirmed in Table 5c, with retrieval times up to 64 % lower when using the smallest vocabulary.

Results in Tables 1–4 show that the performance is much better for modality classification tasks than for retrieval in the complete ImageCLEFmed dataset (see Table 6), probably due to the size of the training set used (1,000 images) in comparison with the 306,000 images in the complete dataset. For the retrieval task, the vocabularies present a comparable performance in terms of recall, being the p _T = 80, K = 10 vocabulary slightly better than the others. However, mean average precision strongly varies between large vocabularies and the smallest vocabulary (p _T = 99, K = 10).

Evaluation of the proposed meaningfulness transformation shows an improvement in accuracy as well as the impact on the vocabulary size already found in the PLSA-based pruning. The increase of accuracy is non-negligible, and passes statistical significance tests. The accuracy is increased for both original vocabularies tested, and there is a slight saturation effect where the size of the descriptor can be safely reduced without impact on accuracy. Massive reductions of the descriptors, strongly reduce performance as well.

It can be discussed that de benefits of the PLSA-based pruning presented are not the ability to discover new and meaningful visual words for retrieval but the ability to recognize those visual words that convey most of the meaning among the ones present in the vocabulary. However, the meaningfulness transform is able to improve the accuracy by increasing the relative weight of the most meaningful visual words.

5 Conclusions and Future Work

In this work a vocabulary pruning and description transformation method based on probabilistic latent semantic analysis of visual words for medical image retrieval and classification is presented. The selection of optimal visual words is performed by removing visual words with a conditional probability over all learnt latent topics that is below a given threshold, the remaining (meaningful) words are weighted according to the largest conditional probability. The process is completely unsupervised, since the learning of the topics is performed without taking into consideration the number of classes or what is the actual class assigned to each image. Therefore, it can be used to reduce massive fine-grained vocabularies to smaller vocabularies that contain only the most meaningful visual words even before training the classifier. To obtain these fine-grained vocabularies, simple clustering algorithms can be used to produce a large number of small clusters that later will be pruned using the methods explained in this paper. Smaller clusters are supposed to encode subtle visual differences among images, which will be preserved by the PLSA-based pruning if they are meaningful for some latent topic. Future applications of the technique also include the use of multiple vocabularies that can be merged and pruned as a single set of discrete features.

We are currently extending the techniques to images obtained for clinical use, where the use of low-dimensional descriptors can achieve fast and accurate characterization of large-scale datasets of high-dimensional (3D, 4D, multimodal) images. This is expected to lead to different results as for the modality classification tasks and retrieval tasks from the literature color plays a more important roles than for most clinical images. Still, the possibility to reduce visual vocabularies strongly can lead to larger base vocabularies that can potentially capture the image content much better but can then be reduced for efficient retrieval.

Notes

1.
http://www.imageclef.org/.
2.
http://goldminer.arrs.org/.
3.
http://www.yottalook.com/.
4.
CIELab is a color space defined by the International Commission on Illumination (Commission Internationale de l’Éclairage) describing all colors visible for humans while trying to mimic the nonlinear response of the eye.

References

Müller, H., Michoux, N., Bandon, D., & Geissbuhler, A. (2004). A review of content-based image retrieval systems in medicine-clinical benefits and future directions. International Journal of Medical Informatics, 73(1), 1–23.
Article Google Scholar
Akgül, C., Rubin, D., Napel, S., Beaulieu, C., Greenspan, H., & Acar, B. (2011). Content-based image retrieval in radiology: Current status and future directions. Journal of Digital Imaging, 24(2), 208–222.
Article Google Scholar
Tang, L. H. Y., Hanka, R., & Ip, H. H. S. (1999). A review of intelligent content-based indexing and browsing of medical images. Health Informatics Journal, 5, 40–49.
Article Google Scholar
Demner-Fushman, D., Antani, S., Siadat, M.-R., Soltanian-Zadeh, H., Fotouhi, F., & Elisevich, K. (2007). Automatically finding images for clinical decision support. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, ICDMW ’07 (pp. 139–144). Washington, DC: IEEE Computer Society.
Chapter Google Scholar
Caputo, B., Müller, H., Mahmood, T. S., Kalpathy-Cramer, J., Wang, F., & Duncan, J. (2009). Editorial of miccai workshop proceedings on medical content-based retrieval for clinical decision support. In Lecture Notes in Computer Science: Vol. 5853. Proceedings on MICCAI Workshop on Medical Content-Based Retrieval for Clinical Decision Support. Heidelberg: Springer.
Google Scholar
Müller, H., Kalpathy-Cramer, J., Kahn, Jr. C. E., & Hersh, W. (2009). Comparing the quality of accessing the medical literature using content-based visual and textual information retrieval. In SPIE Medical Imaging, Orlando, FL (Vol. 7264, pp. 1–11).
Google Scholar
Deserno, T. M., Antani, S., & Long, L. R. (2009). Content-based image retrieval for scientific literature access. Methods of Information in Medicine, 48(4), 371–380.
Article Google Scholar
Müller, H., de Herrera, A. G. S., Kalpathy-Cramer, J., Fushman, D. D., Antani, S., & Eggel, I. (2012). Overview of the ImageCLEF 2012 medical image retrieval and classification tasks. In Working Notes of CLEF 2012 (Cross Language Evaluation Forum).
Google Scholar
Müller, H., Clough, P., Deselaers, T., & Caputo, B., (Eds.). (2010). ImageCLEF: Experimental evaluation in visual information retrieval. The Springer International Series on Information Retrieval (Vol. 32). Berlin/Heidelberg: Springer.
Google Scholar
Leibe, B., & Grauman, K. (2011). Visual object recognition. San Rafael, CA: Morgan & Claypool Publishers.
Google Scholar
Foncubierta-Rodríguez, A., Depeursinge, A., & Müller, H. (2012). Using multiscale visual words for lung texture classification and retrieval. In H. Greenspan, H. Müller, & T. S. Mahmood, (Eds.), Lecture Notes in Computer Sciences: Vol. 7075. Medical content-based retrieval for clinical decision support (pp. 69–79) MCBR-CDS 2011.
Google Scholar
Hinneburg, A., & Gabriel, H.-H. (2007). DENCLUE 2.0: Fast clustering based on kernel density estimation. Advances in Intelligent Data Analysis VII, 4723/2007, 70–80.
Google Scholar
Haas, S., Donner, R., Burner, A., Holzer, M., & Langs, G. (2011). Superpixel-based interest points for effective bags of visual words medical image retrieval. In H. Greenspan, H. Müller & T. Syeda-Mahmood (Eds.), Lecture Notes in Computer Sciences: Vol. 7075. Medical content-based retrieval for clinical decision support, MCBR-CDS 2011.
Google Scholar
Avni, U., Greenspan, H., Konen, E., Sharon, M., & Goldberger, J. (2011). X-ray categorization and retrieval on the organ and pathology level, using patch-based visual words. IEEE Transactions on Medical Imaging, 30(3), 733–746.
Article Google Scholar
Markonis, D., de Herrera, A. G. S., Eggel, I., & Müller, H. (2012). Multi-scale visual words for hierarchical medical image categorization. In SPIE Medical Imaging 2012: Advanced PACS-Based Imaging Informatics and Therapeutic Applications (Vol. 8319, pp. 83190F–11).
Google Scholar
Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In 19th Internaional Conference on Machine Learning (ICML-2002) (pp. 19–26).
Google Scholar
Bilenko, M., Basu, S., & Mooney, R. (2004). Integrating constraints and metric larning in semi-supervised clustering. In 21st Internaional Conference on Machine Learning (ICML-2004).
Google Scholar
Markonis, D., Holzer, M., Dungs, S., Vargas, A., Langs, G., Kriewel, S., et al. (2012). A survey on visual information search behavior and requirements of radiologists. Methods of Information in Medicine, 51(6), 539–548.
Article Google Scholar
Müller, H., Kalpathy-Cramer, J., Demner-Fushman, D., & Antani, S. (2012). Creating a classification of image types in the medical literature for visual categorization. In SPIE Medical Imaging.
Google Scholar
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Yang, Y., & Newsam, S. (2010). Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’10 (pp. 270–279). New York, NY: ACM.
Google Scholar
Ke, Y., & Sukthankar, R. (2004). Pca-sift: A more distinctive representation for local image descriptors. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), Washington, DC. (Vol. 2, pp. 506–513).
Google Scholar
Wengert, C., Douze, M., & Jégou, H. (2011). Bag-of-colors for improved image search. In Proceedings of the 19th ACM International Conference on Multimedia, MM ’11 (pp. 1437–1440). New York, NY: ACM.
Chapter Google Scholar
Banu, M. S., & Nallaperumal, K. (2010). Analysis of color feature extraction techniques for pathology image retrieval system. IEEE.
Google Scholar
Tirilly, P., Claveau, V., & Gros, P. (2008). Language modeling for bag-of-visual words image categorization. In Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval (pp. 249–258). New York: ACM.
Chapter Google Scholar
Tian, Q., Zhang, S., Zhou, W., Ji, R., Ni, B., & Sebe, N. (2011). Building descriptive and discriminative visual codebook for large-scale image applications. Multimedia Tools and Applications, 51(2), 441–477.
Article Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Article Google Scholar
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1–2), 177–196.
Article MATH Google Scholar
Bosch, A., Zisserman, A., & Munoz, X. (2006). Scene classification via plsa. In Computer Vision-ECCV 2006 (pp 517–530). Heidelberg: Springer.
Chapter Google Scholar
Elsayad, I., Martinet, J., Urruty, T, & Djeraba, C. (2012). Toward a higher-level visual representation for content-based image retrieval. Multimedia Tools and Applications, 60(2), 455–482.
Article Google Scholar
Fox, E. A., & Shaw, J. A. (1993). Combination of multiple searches. In Text Retrieval Conference (pp. 243–252).
Google Scholar
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining (adaptive computation and machine learning). Cambridge: The MIT Press.
Google Scholar
McG, D., Squire, Müller, H., & Müller, W. (1999). Improving response time by search pruning in a content-based image retrieval system, using inverted file techniques. In IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL ’99) (pp. 45–49).
Google Scholar

Download references

Acknowledgements

This work was partially supported by the Swiss National Science Foundation (FNS) in the MANY2 project (205320-141300), the EU 7th Framework Program under grant agreements 257528 (KHRESMOI) and 258191 (PROMISE).

Author information

Authors and Affiliations

University of Applied Sciences and Arts Western Switzerland, Technoark 3, 3960, Sierre, Switzerland
Antonio Foncubierta Rodríguez, Alba García Seco de Herrera & Henning Müller

Authors

Antonio Foncubierta Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Alba García Seco de Herrera
View author publications
You can also search for this author in PubMed Google Scholar
Henning Müller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henning Müller .

Editor information

Editors and Affiliations

Centre for Research and Technology, Hellas, Information Technologies Institute, Thermi, Thessaloniki, Greece
Alexia Briassouli
Laboratoire Bordelais de Recherche en Informatique/University Bordeaux, Talence, France
Jenny Benois-Pineau
Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Alexander Hauptmann

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rodríguez, A.F., de Herrera, A.G.S., Müller, H. (2015). Meaningful Bags of Words for Medical Image Classification and Retrieval. In: Briassouli, A., Benois-Pineau, J., Hauptmann, A. (eds) Health Monitoring and Personalized Feedback using Multimedia Data. Springer, Cham. https://doi.org/10.1007/978-3-319-17963-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-17963-6_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17962-9
Online ISBN: 978-3-319-17963-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Meaningful Bags of Words for Medical Image Classification and Retrieval

Abstract

Similar content being viewed by others

The Image Classification with Different Types of Image Features

Color and texture applied to a signature-based bag of visual words method for image retrieval

Content-based image retrieval and semantic automatic image annotation based on the weighted average of triangular histograms using support vector machine

Keywords

1 Introduction

2 Materials and Methods

2.1 Data Set

2.2 Descriptors

2.2.1 SIFT

2.2.2 Bag of Colors

2.3 Vocabulary Pruning and Descriptor Transformation Using Probabilistic Latent Semantic Analysis

2.3.1 Probabilistic Latent Semantic Analysis

2.3.2 PLSA for Visual Words

2.3.3 Vocabulary Pruning

2.3.4 Meaningfulness-Based Descriptor Transformation

2.4 Experiments

2.4.1 Classification with a Truncated Descriptor

2.4.2 Classification with a Reduced Vocabulary

2.4.3 Retrieval with a Reduced Vocabulary Over the Complete Data Set

2.4.4 Classification Using Descriptor Transformation

3 Results

3.1 Truncated Descriptor

3.2 Reduced Vocabulary Over Modality Classification Training and Test Sets

3.3 Reduced Vocabulary for the Retrieval Task

(b) Mean Average Precision (MAP) across all topics

(c) Average execution times of the online phase for a single query image

3.4 Descriptor Transformation and Effect on Vocabulary Size

4 Discussion

5 Conclusions and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation