Keywords

1 Introduction

Over the last three decades, healthcare sector is ruled by medical imaging where the physician mainly depends on various medical imaging modalities to diagnose and treat the diseases. Based on the type of diseases, different modalities are preferred for different organs. For example, X-rays are suitable for diagnosing lung disease and bone fractures, CT for tumour detection in head and abdominal disease, ultrasound during pregnancy, etc. With the day-to-day invention of latest medical equipment, medical images are also acquired at an increasing rate. These bulk volumes of images are stored in a centralized repository and accessed frequently for diagnosis and study purposes. Retrieving image from such a large repository poses a difficult task, and hence, an effective and efficient computerized system is required to retrieve such images. The survey shows that modality is used as one of the filters to reduce the search space [1]. Hence, an automated system to identify and classify the modality of medical images becomes an emerging area of the research.

To promote the research in this sector, ImageMedCLEF—a forum, organized contest for modality classification task from the year 2010 [1]. Modality classification is the important task of ImageMedCLEF till 2013. Evaluation of the contest is based on the percentage of correctly classified images. Many research groups registered for the contest and submitted their promising results. The latest task proposed by ImageMedCLEF in 2016 such as compound figure separation also requires the results of modality classification task. Hence, the research is continued in modality classification task and explored in multidimensions to outperform the classification results obtained so far.

This paper is organized as follows: literature review is given in Sect. 2. The proposed work is discussed in Sect. 3, and experimental results are reported in Sect. 4. At the end, Sect. 5 concludes the paper and explores the way to extend the work in future.

2 Related Work

Several research groups performed experiments on modality classification tasks and submitted their results in ImageCLEFmed2013 competition [2]. IBM, the research group stood first in modality classification task, adopted sophisticated multimodal fusion techniques and obtained 81.68% classification accuracy [3]. FCSE group ranked second in modality classification task extracted densely sampled SIFT features and employed spatial pyramid [4]. The medGIFT group ranked fourth in modality classification performed feature fusion from many features descriptors and to name a few, colour and edge directivity descriptor (CEDD), bag of visual words (BoVW) using SIFT, fuzzy colour and texture histogram (FCTH) [5].

The fifth position bagged by Image and Text Integration (ITI) group adopted flat and hierarchical classification strategies with SVM [6]. The best classification accuracy in the modality classification task obtained by DEMIR research team was 64.60%, and they preferred mixed approach by combining CEDD, FCTH and colour layout descriptor (CLD) features along with textual information [7]. MIILab (Medical Image Information Laboratory) participated in ImageCLEFmed2013 modality classification task and submitted the results [8]. They extracted features using the fast filtering techniques and SURFContext with classical BoF (bags of features) approach. The overall classification accuracy is around 65%. Dimitrovski I et al. [9] evaluated classification results from different combinations of visual and textual descriptors and obtained 87.10% accuracy which is the best classification result reported so far. In [10], authors extracted different visual and textual features and employed a strategy called joint kernel equal contribution (JKEC) to give equal weightage to all the features used. Kalpathy-Cramer et al. developed neural network-based, hierarchical classifier and achieved greater than 95% classification accuracy with greyscale image [11].

Csurka et al. [12] used Fisher vector representation of the images from visual aspect and the image captions from textual aspect for classification. The authors in [13] used BoVW, bag of colours (BoC), CEDD, FCTH and fuzzy colour histogram (FCH) descriptors to represent the image. Thus, the detailed survey proves that SIFT features are used in almost many modality classification tasks giving the best classification results. Hence, experiments are conducted to optimize some parameters in the existing SIFT feature extraction and also in the classification methodologies to improve the overall accuracy still better.

3 Proposed Methodology

On seeing the frequent application of SIFT-based BoVW (bag of visual words) representation of images particularly for the classification tasks, we intended to extract dense SIFT features from the image and converted to BoVW histogram but along with some modifications in the parameters normally employed. The changes are introduced based on the contributions from the three works as follows:

  • Akata et al. [14] suggested different ways to improve the classification accuracy with large-scale images. Among them, one suggestion is to have good number of training images.

  • Vedaldi and Zisserman [15], in their assignment on Image Classification Practical, 2011, suggested to include spatial histogram to improve the classification accuracy.

  • Swathi Rao [16] proved that hellinger kernel outperformed linear kernel.

The proposed method combined the advantages of the above-stated three approaches and tested for experiments. The proposed system consists of extraction of densely sampled SIFT descriptors of reasonable size of training images, inclusion of spatial histogram from bag-of-words representation of images and comparison of classification results with SVM classifier using linear and Hellinger kernel mapping. The architecture of the proposed system is illustrated in Fig. 1. The classifier performance is evaluated at three stages, stage I with the number of images, stage II with inclusion of spatial histogram and stage III with the comparison of linear and Hellinger kernels.

Fig. 1
figure 1

Architecture of the proposed system for modality classification

The various stages of the proposed system are discussed briefly in subsequent subsections.

3.1 Dense SIFT Feature Extraction

Bag of visual words formed with SIFT features is used traditionally in many classification problems. SIFT keypoints can be extracted in three modes, key point detection, dense sampling and random sampling. SIFT keypoint represents a circle with its centre depicting x and y coordinates, the radius of the circle depicting scale and the angle depicting its orientation.

To obtain keypoints at multiple scales, Gaussian scale space is constructed. The scale space is a collection of images obtained by smoothing the input images progressively. Such a scale space is shown in Fig. 2. Smoothing the image results in reducing the resolution of images.

Fig. 2
figure 2

Scale space of one image from training set. With increasing scale, it is observed that resolution of image decreases

The keypoints are then extracted at four different scales (sigma = 0.6, 1, 1.3 and 1.6 for the Gaussian filter) and sampled densely with an interval distance of 4 pixels in an image grid. For each keypoint, 128-dimensional descriptor is obtained. To reduce the large dimension of descriptors, the obtained descriptors are then mapped to a codebook containing say 1000 codewords. Then, histogram containing the proportion of the descriptors to that specific codeword is constructed.

3.2 Bag of Visual Words

The origin of BoVW is based on the regular text analysis. Normally any text document is interpreted as the collection of words and to analyse the document; we identify the frequency of occurrences of those words. Similarly, the image can also be interpreted as the collection of visual words and to analyse the image, we identify the frequency of occurrences of those visual words.

Among three modes of SIFT feature extraction, dense sampling approach provides more keypoints as the features are extracted from the whole grid image with an interval of normally 2–4 pixels. Hence, much feature will be obtained with this approach when compared with the other two modes, keypoint detection and random sampling. Thus, to reduce the feature descriptor size appropriately, feature quantization is done by simply running k-means on the obtained descriptors. The centroids of the k-means represent the visual words of the image.

The various steps in forming the visual words of an image are as follows:

  1. 1.

    Dense SIFT features are extracted from the training images.

  2. 2.

    Each feature has its descriptors in 128 dimensions. k-means with say 1000 centroids is run on the obtained SIFT descriptors to end up with 1000 words.

  3. 3.

    To represent a particular image using the visual vocabulary, again dense features are extracted from it and assigned to the visual vocabulary. The assignment is based on calculating the Euclidean distance (L2 distance) between a word and a given descriptor.

  4. 4.

    Finally, a histogram of visual words is built to represent that particular image.

The procedure for representing BoVW histogram for one image is visually summarized in Fig. 3.

Fig. 3
figure 3

Formation of BoVW histogram

3.3 Bag of Visual Words with Spatial Information

Another approach to improve classification accuracy is incorporating spatial information on the existing plain BoVW histogram containing 1000 words. To achieve this, the given image is divided into 2 × 2 subregions and the histogram is computed for each subregion. Thus, 4 histograms with 1000 words are obtained and they are then stacked to form an array of single dimension of size 4000 (1000 × 4). Figure 4 shows the partition of an image into 2 × 2 subregions.

Fig. 4
figure 4

BoVW with 2 × 2 spatial tiling on left and its histogram for each tile on right

3.4 SVM Classifier with Linear and Hellinger Kernel

The support vector machine (SVM) introduced by Boser, Guyon and Vapnik in 1992 is used as the classifier along with kernel trick to maximize the margin of hyperplanes [17]. This algorithm just plots the feature in feature space, and using hyperplane, it identifies the boundary of each class. Kernel trick is employed to identify the best hyperplane segregating the different classes. Two SVM classifiers one with linear and other with Hellinger kernel are used for classification. Square root of the histogram is considered for implementing Hellinger kernel.

To classify the images of multiple classes, two flavours of SVM, one-versus-one and one-versus-all approaches, can be used. We preferred one-versus-all approach in which a classifier is built for each modality/class. The examples pertaining to that class are assigned positive labels and the remaining examples are assigned as negative labels. SVM with linear and Hellinger kernel mapping is used.

The one-vs-all SVM classifier classifies the feature vector as positive or negative using the Eq. 3.1.

$$\begin{aligned} & w^{t} x + b \ge\; 0\; {\text{for}}\,{\text{positive}}\,{\text{classification}} \\ & w^{t} x + b < 0\; {\text{for}}\,{\text{negative}}\,{\text{classification}} \\ \end{aligned}$$
(3.1)

where x, w and b are the feature vector to be classified, weight vector and bias, respectively. The values of w and b are determined during training process and the equations are then used to obtain decision hyperplane which classify the images as positive or negative. The crucial aspect is to find a set of weight and bias such that the margin is maximized. Kernel tricks are employed to obtain the best margin. The kernel makes the data linearly separable.

4 Results and Discussion

Data set

The experiments are carried out on 780 images of six different modalities. The training set consists of 50% of images while the testing set forms 50% of images in the data set. Table 1 contains the detailed split up of the images into training and testing set.

Table 1 Training and testing samples for each class

The images are collected from open-i biomedical image search engine filtered by image modality and PubMed collections [18]. Examples of images from the training data set are shown in Fig. 5.

Fig. 5
figure 5

Sample image from each modality/class in training set

The images obtained are of different size, and it is resized not to exceed 480 pixels in the row, and the column is adjusted automatically such that image aspect ratio is preserved. In all experiments, densely sampled SIFT features on the whole image grid with an interval of 4 pixels are extracted at 4 scales with sigma of 0.6, 1, 1.3 and 1.6. k-means with 1000 centroids is then applied on the extracted features.

As a next level, the image is partitioned into 2 × 2 subregions and again the histogram is computed separately for each subregion.

The visual words of training image from each modality are formed and their histograms are constructed as shown in Fig. 6.

Fig. 6
figure 6

Sample histogram from each modality/class in training set

This histogram is the signature of the image, and because of its uniqueness for each modality, the classifiers are trained with different histograms. One-vs-all SVM classifiers for all modalities are tested for all the test images with two variants of SVM classifiers—linear and hellinger kernels.

The proposed system is evaluated by identifying the overall classification accuracy. The overall accuracy of the system is the ratio of the number of correctly classified images to the number of all images. This is the commonly used evaluation strategy for any classification problem.

The results are tabulated as the confusion matrix for the test set and the main diagonal depicts the number of images correctly classified.

Evaluation with similar training and testing sets was performed for the following choices:

  1. 1.

    Varying number of training images.

  2. 2.

    BoVW histogram and BoVW spatial histogram.

  3. 3.

    SVM with linear and Hellinger kernel.

In this section, the results of the proposed system for automatic classification of medical imaging modalities are reported. Six runs are performed for modality classification task. The classification result of all runs for each modality classifier is shown as confusion matrix.

  • Run 1: SVM with linear kernel considering 10% of training images and 2 × 2 spatial histogram.

  • Run 2: SVM with linear kernel considering 50% of training images and 2 × 2 spatial histogram.

  • Run 3: SVM with linear kernel considering 100% of training images and 2 × 2 spatial histogram.

  • Run 4: SVM with linear kernel considering 100% of training images and histogram without spatial information.

  • Run 5: SVM with hellinger kernel considering 100% of training images and histogram without spatial information.

  • Run 6: SVM with hellinger kernel considering 100% of training images and spatial histogram.

In all the above runs, in addition to the overall classification accuracy, the following metrics are calculated:

$${\text{Accuracy = }}\frac{{{\text{Number of}}\,{\text{correctly}}\,{\text{classified}}\,{\text{images}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{images of}}\,{\text{that}}\,{\text{ground}}\,{\text{truth}}\,{\text{class}}}}$$
(4.1)
$${\text{Reliability = }}\frac{\text{Number of correctly classified images}}{\text{Number of images obtained as that class}}$$
(4.2)
$${\text{Overall}}\,{\text{accuracy = }}\frac{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{correctly}}\,{\text{classified}}\,{\text{images}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{test}}\,{\text{images}}}}$$
(4.3)

The kappa is another metric that is also used to evaluate the classifiers. It compares observed accuracy with expected accuracy from a random classifier. It is calculated using the formula

$${\text{Kappa = }}\frac{\text{observed accuracy} - \text{expected accuracy}}{{ 1 {- \text{ expected accuracy}}}}$$
(4.4)

The classifier for each modality is trained with 65 images of each modality. The entire test image set consisting of 65 images of each modality is given to all the classifiers to classify the corresponding modality images. The confusion matrix for the six runs is tabulated as shown in Tables 2, 3, 4, 5, 6, 7. The various metrics are calculated to assess the performance of the classifier as given in Eqs. 4.14.4.

Table 2 Confusion matrix of Run 1—SVM with linear kernel, 10% of training images and 2 × 2 spatial histogram
Table 3 Confusion matrix of Run 2—SVM with linear kernel, 50% of training images and 2 × 2 spatial histogram
Table 4 Confusion matrix of Run 3—SVM with linear kernel, 100% of training images and 2 × 2 spatial histogram
Table 5 Confusion matrix of Run 4—SVM with linear kernel, 100% of training images and simple histogram
Table 6 Confusion matrix of Run 5—SVM with Hellinger kernel, 100% of training images and simple histogram

The overall classification accuracy and kappa for the six runs are tabulated in Table 8. According to Fliess, kappa > 0.75 is the best classifier, 0.40–0.75 is as fair as good and <0.40 is the worst classifier [19].

Table 7 Confusion matrix of Run 6—SVM with Hellinger kernel, 100% of training images and spatial histogram

Table 8 shows that both classification accuracy and kappa keeps on increasing in the consecutive 6 runs. The 6th run, the combination of SVM with Hellinger kernel, spatial histogram and 100% training images gives the better classification accuracy of 73.077% and kappa of 0.677. Even though the classifier cannot be rated as the best, it is as fair as good, according to Fliess. The overall classification accuracy and kappa for the different runs are plotted and shown in Figs. 7 and 8, respectively.

Table 8 Overall classification accuracy and kappa
Fig. 7
figure 7

Overall classification accuracy plot of six runs. Sixth run, the combination of SVM classifier with Hellinger kernel, spatial histogram and 100% training images, gives the best classification accuracy of 73.077%

Fig. 8
figure 8

Kappa plot of six runs. Sixth run, the combination of SVM classifier with Hellinger kernel, spatial histogram and 100% training images, gives the best value of 0.677. Kappa with the range of 0.40–0.75 is rated as good classifier

The comparison of the proposed work with the results submitted by the research groups in the conference organized by ImageCLEF 2013 for modality classification task is tabulated in Table 9.

Table 9 Comparison of proposed work with existing research works submitted in ImageCLEF 2013 modality classification task

The output of the best run for each class is shown in Fig. 9. It can be seen from the output that some images are misclassified in each class. XR and US classifiers perform much better compared with other modality classifier. The reason behind that is PET mostly comes in combination with CT which is misclassified as CT. As visual similarities among CT, MR and PET are confusing even for human, the system predicts many images from these groups in a wrong manner. Hence, further tuning of the parameters is still required to improve the classification accuracy still better. Perhaps if the training set is built strongly including similar type of images which are wrongly misclassified, the classification accuracy can be improved still better. But that approach also should not end up in overfitting. Hence, deep analysis of wrongly misclassified images should be taken into consideration and the changes in the parameters from multiple views can be performed to achieve the goal.

Fig. 9
figure 9figure 9figure 9

a Output of XR classifier. b Output of US classifier. c Output of PET classifier. d Output of PX classifier. e Output of CT classifier. f Output of MR classifier

5 Conclusion

The experimental results are reported for the proposed system to classify the modalities of medical images. This work is mainly to integrate into medical image retrieval system where the medical images are retrieved based on its modality. Using a data set of 780 images, six approaches are evaluated and the approach combining densely sampled SIFT descriptors and bag-of-words spatial histogram along with Hellinger kernel mapping of SVM gives the best overall classification accuracy. The maximum overall classification accuracy obtained is 73.077%. In the experiments, we have shown that increasing training images, incorporating spatial histogram and extending linear to Hellinger kernel mapping of SVM produce good results. As an extension to existing work, we plan to tune other parameters in future to improve classification results.