Keywords

1 Introduction

Melanoma is one of the most common types of cancer in Europe, North America, and Australia. Due to its rapid growth, it is able to metastasize to other organs, such as lungs, bones, or brain [1]. The diagnosis of skin lesions follows a specific guideline: (i) inspection of the lesion using a magnification device; (ii) assessment of different criteria, such as the ABCD rule [2] or the 7-point checklist [3]; and (iii) scoring the lesion based on the identified criteria. Although the aforementioned medical rules are well established and guarantee an increase in the accuracy of the diagnosis, the evaluation still critically hinges on visual inspection and on the expertise of the dermatologist [1]. This means that the analysis of lesions is a highly subjective and difficult task.

Modern inspection devices are able to acquire images of the lesions, obtained with or without special illumination, dividing the images into two types: dermoscopy and clinical. For the past two decades, research groups have been working on computer aided diagnosis (CAD) systems to diagnose the skin lesions, using either dermoscopy or clinical images [4, 5]. These CAD systems can be used as a support tool by dermatologists with any level of expertise, reducing the subjectivity of the diagnosis. This work uses dermoscopy images, thus from this point on we will discuss specific aspects of its automatic analysis.

CAD systems follow three main steps: (i) lesion segmentation; (ii) feature extraction; and (iii) lesion diagnosis [4]. Aside from lesion segmentation, which by itself is a major challenge, there is a significant diversity in the type of features and classifiers used in steps (ii) and (iii). A short list of classifiers that have been applied includes K-nearest neighbor, AdaBoost, support vector machines, and neural networks [5]. The extracted features can be divided into two categories: global and local. The former consists of computing a single vector to describe the entire lesion. This vector can comprise information about the shape and symmetry of the lesion (e.g., area, circularity measure, shape symmetry), color (e.g., RGB or HSV histograms), and texture e.g., gray level co-occurrence Matrix) [6]. Local features allow us to separately characterize different regions of the lesion. This can be seen as an approximation of the analysis performed by dermatologists, since they also assess different regions of the lesions. A simple strategy to compute local features is the bag-of-features (BoF) approach, which has been applied with success in different works (e.g., [7, 8]). More recently, a different method has been used to obtain local features: sparse coding (SC) [9, 10]. This strategy arises from relaxing the restrictive constraints of the BoF optimization problem, as will be discussed in Sect. 2, and has been shown to be efficient in capturing salient properties of the image in different computer vision problems (e.g., [11, 12]).

Both BoF and SC have achieved promising classification results in dermoscopy image analysis. However, a direct comparison between the two types of features has been missing. In this paper, we fill this gap by providing a comparison between the two methods, to assess which one performs the best. The remaining sections of the paper are organized as follows. In Sect. 2, we discuss the formulations of BoF and SC and compare them. In Sect. 3 we describe the experimental framework, and in Sect. 4 we present the results.

2 Local Features - From BoF to Sparse Coding

The BoF method assumes that an image can be represented as a collection of elements of a dictionary of visual words (atoms). Assuming that a dictionary D of K elements is known, any image is processed as follows: (i) a set of M patches is extracted and a feature vector \(x_{m} \in \mathbb {R}^{D}\) is computed for each of them; (ii) the features are matched to the closest dictionary element, as follows

$$\begin{aligned} \min _{\alpha _{m}}\Vert x_{m}-D\alpha _{m}\Vert _{2}^{2} \nonumber \\ \mathbf{s.t} \;\;\; \alpha _{m}\in \{0,1\}^{K}, \; \Vert \alpha _m \Vert _0=1 , \end{aligned}$$
(1)

where \(\Vert .\Vert _0\) denotes the \(\ell _0\)“norm"; (iii) this information is summarized into a histogram of occurrences that counts the number of times each atom was selected.

The constraint used in (1) is very restrictive. In order to deal with this issue one can use the SC formulation. Similarly to BoF, the first step of SC is the extraction of a set of M image patches, followed by the computation of a feature vector to characterize each patch. The following step is to match the feature vectors to atoms of a known dictionary. However, instead of assuming that each vector is only associated with one of the atoms, SC assumes that each vector is a combination of a small number of atoms. This can be formulated as an optimization problem with a regularization based on the \(\ell _1\) norm

$$\begin{aligned} \min _{\alpha _{m}}||x_m-D\alpha _m||_{2}^2 + \lambda ||\alpha _m||_{1} , \end{aligned}$$
(2)

where \(\alpha _{m} \in \mathbb {R}^{K} \) is a vector of coefficients and \(\lambda \) is a non-negative parameter specified by the user, which controls the relevance of the regularization term. Using the \(\ell _1\) norm in the regularization term encourages sparsity of the coefficients, i.e., only a small number of them is non-zero. Additional constraints can be added to the problem, such as setting \(\alpha _m \ge 0\) [12].

Aside from the patch representation, another main difference between BoF and SC is the strategy used to estimate the dictionaries. In both cases these are estimated using a training set of N feature vectors \(\{x_{1},...,x_{N}\} \in \mathbb {R}^{D}\), extracted from the patches of several images. According to the BoF formulation (see (1)) D can be estimated as follows

$$\begin{aligned} \min _{\alpha _{1},...,\alpha _{N}, D}\sum _{n=1}^{N} \Vert x_{n}-D\alpha _{n}\Vert _{2}^{2} \nonumber \\ \mathbf{s.t} \;\;\; \alpha _{n}\in \{0,1\}^{K}, \; \Vert \alpha _n \Vert _0=1, \; \forall n . \end{aligned}$$
(3)

This optimization problem can be solved using a clustering algorithm, such as k-means [13].

In the SC formulation, a dictionary of K elements is obtained by solving the following optimization problem

$$\begin{aligned} \min _{\alpha _{1},...,\alpha _{N},D} \sum _{n=1}^{N} ||x_{n}-D\alpha _{n}||_{2}^2 + \lambda ||\alpha _{n}||_{1} \nonumber \\ \mathbf{s.t} \;\;\; \Vert d_{k}\Vert _{2} \le 1,\; k=1,...,K, \end{aligned}$$
(4)

where \(d_k \in \mathbb {R}^{K}\) is the k-th column of D. The normalization constraint \(\Vert d_k \Vert _2 = 1\) is used to avoid trivial solutions for the dictionary, namely having the columns of D growing to infinity and the \(\alpha \) coefficients approaching zero.

The estimation of \(\alpha \) according to (2) is a convex problem, which can be solved using one of several special-purpose algorithms that have been developed of this problem [14]. On the other hand, the problem (4) is not convex and has been the focus of a considerable amount of recent research. The standard approach to solve (4) is to alternate between estimating the SC coefficients, keeping the dictionary fixed, and updating the dictionary [14]. Formally:

  1. (i)

    Fix the dictionary D and solve

    $$\begin{aligned} \min _{\alpha _{1},...\alpha _{N}} ||x_{n}-D\alpha _{n}||_{2}^2 + \lambda ||\alpha _{n}||_{1} . \end{aligned}$$
    (5)
  2. (ii)

    Fix \(\alpha _{1},...,\alpha _{N}\) and solve

    $$\begin{aligned} \min _{D} \sum _{n=1}^{N} ||x_{n}-D\alpha _{n}||_{2}^2,\;\; \mathbf{s.t.} \;\Vert d_{k}\Vert _{2} \le 1,\; k=1,...,K . \end{aligned}$$
    (6)

These two steps are repeated for a predefined number of iterations or until some convergence criterion is satisfied. An extensive review on different methods to solve these optimizations can be found in [14].

A final difference between the two methods resides in how the final image representation is obtained. As stated at the beginning of this section, BoF represents the image as a histogram of occurrences of atoms. The same approach can not be directly applied to SC, since the vectors \(\alpha _{m}\) select more than one atom, with different weights. Different pooling strategies have been proposed to tackle this issue: e.g., max-pooling or mean of the absolute values of \(\alpha \) [11].

3 Experimental Framework

The goal of this paper is to perform a fair comparison between the BoF and SC representations. Therefore, we must maintain common parameters of the methods constant, such as the type and size of patches extracted from the images and the features used to describe them, and adjust only what is specific of each method. In the sequel we present the experimental setting used obtain our results.

  1. (i)

    Patch extraction/image sampling: \(16\times 16\) overlapping (step of 8 pixels) patches are extracted from all of the images. Although it is possible to extract patches from the entire image (e.g., [9]), we chose to extract patches only from a bounding box around the lesion. This allows working with the images in their original size (average \(560 \times 750\)), without unbearable computational costs. The area of the image containing the lesion is identified using manual segmentation.

  2. (ii)

    Patch features: Color and texture features are computed for each patch, namely color histograms for the RGB and HSV color spaces, and gradient histograms (amplitude and orientation). All these histograms have 16 bins.

    The aforementioned fetures are not the ones used in other sparsity-based dermoscopy works [9, 10]. Those works use the vectorized patches in either gray level or RGB space, and learn dictionaries to represent that information. Nowadays, learning the dictionaries directly from image patches is a very popular approach [14]. Image patches are not suitable to be tested in the BoF framework, because the size of the resulting feature vectors lead to very high computational costs. Nonetheless, we will use image patches as features in SC, in order to establish a comparison with related works.

  3. (iii)

    Dictionary learning: k-means was used to obtain the BoF dictionaries, while the online dictionary learning method [15] (available in the SPAMS software packageFootnote 1) was used to obtain the SC dictionaries. The size of all the dictionaries was chosen in the set \(K \in \{2^{7},2^{8},2^{9}\}\).

  4. (iv)

    Pooling: The final BoF descriptor was obtained using the traditional vector quantization approach (see (1)), followed by histogram building.

    In the case of SC, the \(\alpha \) vectors of the patches were obtained using the LARS algorithm [16] (available in SPAMS). Two optimization problems were applied on this phase: the traditional one (2) and one obtained by inserting a non-negativity constraint.The combination of all of the \(\alpha _{m}\) vectors of a certain image is performed using two strategies: max-pooling (max) and average absolute pooling (abs), respectively defined as follows:

    $$\begin{aligned} \alpha ^{j} = \max \{|\alpha _{1}^{j}|, |\alpha _{2}^{j}|,...,|\alpha _{M}^{j}|\} , \end{aligned}$$
    (7)
    $$\begin{aligned} \alpha ^{j} = \frac{1}{M}\sum _{m=1}^{M}|\alpha _{m}^{j}| , \end{aligned}$$
    (8)

    where \(\alpha ^{j}\) is the j-th component of the vector \(\alpha \), M is the number of patches, and \(\alpha _{m}^{j}\) is the j-th component of the m-th patch vector.

  5. (v)

    Classification: The diagnosis was obtained using a SVM with a radial basis function (RBF) kernel (available in MATLAB 2015b®). A different classifier was trained for each of the possible feature configurations, using a set of dermoscopy images diagnosed by experts. In each of the experiments we tunned the width of the RBF kernel \(\rho \in \{2^{-12},2^{-5},...,2^{12}\}\) and the penalty term \(C \in \{2^{-6},2^{-4},...,2^{6}\}\) given to the soft margin.

4 Experimental Results

4.1 Dataset and Performance Metrics

All of the experiments were carried out on a heterogeneous dataset of 804 images (241 melanomas), selected from the EDRA database [1]. The ground-truth diagnosis was provided by a group of experts.

The different configurations were evaluated in terms of sensitivity (SE), specificity (SP), and a cost score (S) defined as follows

$$\begin{aligned} S = \frac{c_{10}(1-SE)+c_{01}(1-SP)}{c_{10}+c_{01}} , \end{aligned}$$
(9)

where \(c_{10}\) is the cost of an incorrectly classified melanoma (false negative) and \(c_{01}\) is the cost of an incorrectly classified non-melanoma (false positive). Since we consider that an incorrect classification of a melanoma is a more serious error, we set \(c_{10} = 1.5c_{01}\) and \(c_{01} = 1\). The results were obtained using a 10-fold nested cross-validation strategy, where the images were divided into 10-folds, each with approximately the same proportion of benign and malignant lesions. One of the folds was kept for testing, while the remaining nine were used for training and parameter selection. This procedure was repeated ten times with a different fold for testing, and the results are the average performance.

4.2 Results

Table 1 shows the comparison between BoF and SC for the different features herein considered. Several conclusions can be drawn from these scores. The first is that max-pooling leads to significantly better results than abs-pooling. This happens in almost all the features. Moreover, the non-negativity constraint also improves the results (only showed for max-pooling). Interestingly, the features used in other dermoscopy works (gray level and RGB patches) achieve worse scores than the other tested color and texture features. Finally, we are able to show that SC outperforms BoF in almost all of the experiments, which suggest that this approach is more efficient.

Table 1. Results for melanoma diagnosis using BoF and SC. In bold we highlight the best results.

Table 2 shows the number of images that are correctly and incorrectly classified by BoF and SC, using the best configuration (HSV histogram). These values show that 50% of the images incorrectly classified by BoF are correctly classified by SC. Although the opposite is also true (58 images), it happens in a much smaller extent. We would like to point out that the scores obtained with SC using a single feature still outperform the best results obtained for this dataset with feature fusion (\(SE = 83\%\), \(SP = 76\%\)) [17].

Table 2. Number of images correctly and incorrectly classified by each of the methods using the HSV histograms as patch features.
Fig. 1.
figure 1

Malignant (1st row) and benign (2nd row) lesions, correctly classified by both methods.

Figure 1 shows examples of lesions correctly classified by both methods, using their best configurations, while Fig. 2 shows examples of lesions incorrectly classified by one of the methods.

Fig. 2.
figure 2

Malignant (1st and 3rd columns) and benign (2nd and 4th columns) lesions, incorrectly classified by BoF (1st–2nd columns) and SC (3rd–4th columns).

5 Conclusions

In this paper, we have compared bag-of-features and sparse coding in the problem of melanoma diagnosis. A simple framework was used to compare the two methods, where the idea was to keep fixed the common variables and only adjust the key aspects that are specific of each of the methods. This allowed us to perform a fair comparison and show that SC outperforms BoF, obtaining a sensitivity = 85.5% and specificity = 73.4% vs. sensitivity = 81.7% and specificity = 66.5%, for the corresponding best configurations.