Keywords

1 Introduction

Deep learning has emerged as a new branch of machine learning. It has proven to be very effective in many computer vision tasks such as image classification [15], object detection [25], image segmentation [8], and others. In addition to reporting high accuracy rates, Deep Learning eliminated the requirement for human experts to design feature extractors since convolutional layers of Convolutional Neural Networks (CNNs) are suited for this task.

However, even in the face of all these advantages, Deep Learning used to fail in interpretability [13]. This attribute may be crucial, especially in high misclassification costs. To attack this black box issue, Zhou et al. [24] proposed the Class Activation Map (CAM), which highlights the most significant image regions to produce a prediction by a CNN. This technique modifies the network architecture, replacing the fully connected layers with convolutional layers and a Global Average Pooling (GAP). Then, the channels from the output of the last convolutional layer are weighed by the network parameters that link each element in the GAP output to the neuron of the activated class. As a result, this weighted sum of channels is the final visual explanation provided by CAM. More recently, Selvaraju et al. proposed the Grad-CAM [21], this method can be applied to many CNN models without requiring architectural changes. For this, it calculates the gradient of the last convolutional layer concerning the network output, which measures the influence of each cell in the feature map to compose the network prediction.

The huge success of Deep Learning methods currently overshadows classic techniques based on Handcrafted (HC) features for image classification [10]. However, some researchers in the literature suggest a careful comparison between Learned (LN) and HC features. Nanni et al. [17] ran an exhaustive comparison between the two approaches in different image domains, from butterfly species classification to cancer detection. Their experiments showed several scenarios where HC features outperformed the LN features in accuracy. In early 2020, Lin et al. [12] proposed a random forest to identify Magnetic Resonance (MR) images of livers that are adequate for clinical diagnosis. They reported that HC features outperformed LN features across smaller datasets, i.e., less than 200 images for model training. In 2021, Saba et al. [20] investigated the problem of detecting microscopic skin cancer in non-dermoscopic color images. They reported cases where HC features were better than LN features. Finally, in 2022, Silva et al. [22] evaluated HC and LN features in the context of violence detection in video frames. Their results showed that LN features can not always be claimed superior since some violent scenes are only detected by HC features.

A widely used image representation technique based on local HC features is the Bag of Visual Words (BoVW) [5]. Concerning the existence of many local descriptors along a single image, a keypoint is referred to as a structure composed of a feature/descriptor vector and an image coordinate to indicate the local region described by such feature/descriptor. The final BoVW image representation is an histogram of the occurrences of clustered handcrafted features/descriptors presented in the given image. Finally, the BoVW histograms may feed a classifier like Support Vector Machine (SVM) [9]. This work proposes a visualization method that allows the interpretation of the most important regions for image classification using BoVW. Several works [12, 17, 20, 22] previously evaluated the accuracy rates obtained by HC and LN features to conclude that they focus on different aspects of the images. However, to the best of our knowledge, such divergence was not demonstrated in the literature at the image domain level.

2 Background

2.1 Keypoints

Keypoints refer to structures for encapsulating the representation of local features along a given image. Therefore, for representing a single image patch, a keypoint has a feature/descriptor vector that holds information about the image semantics locally and a coordinate tuple that localize it within the image. The extraction of keypoints is then composed of at least two main steps for retrieving: (i) the keypoint localization, and (ii) the keypoint feature/descriptor.

On the one hand, a good keypoint localizer identifies local regions that are potentially distinct along the image. Such uniqueness is crucial for representing the image’s elements that allow its identification. Example of algorithms for keypoint localization includes FAST [18], BRISK [11], ORB [19], SURF [3], SIFT [14], and KAZE [2]. On the other hand, a good keypoint descriptor faithfully characterizes image local regions. Example of techniques for extracting keypoint descriptors are BRISK [11], FREAK [1], BRIEF [4], SURF [3], ORB [19], SIFT [14], KAZE [2]. Those are all handcrafted techniques, i.e. such algorithms are humanly designed and data invariant.

Keypoint Localization.

Keypoint localizers generally try to find more representative image patches in relation to their neighbors. This representation can be through aspects such as corners, colors, or brightness. A classic method for locating keypoints is the Harris Corner Detector. From the dx and dy image gradients, a Harris response map is generated by encoding the magnitude of gray level changes in both horizontal and vertical directions for each \(3 \times 3\) image window. Finally, each pixel in the image whose Harris response exceeds a predefined threshold \(\tau \) is assigned as a corner.

Another widely used method is the FAST (Features from Accelerated Segment Test). Considering a Bresenham circle of radius three centered at each pixel in the image, the FAST compares the gray value of the central pixel to each intensity along the Bresenham circumference. If an amount of N consecutive pixels of this circumference is brighter or darker than the central point, it is classified as a corner. To speed up the method, it is possible to use a machine learning-based approach for detecting consecutive patterns in a sequence. Then, after extracting these 16-pixel circumferences and their central intensity values, it is possible to train a classifier as a decision tree [16] to decide whether or not this point is a corner.

Other methods like SIFT [14], SURF [3] and KAZE [2] uses multiscale analysis. SIFT algorithm, for instance, computes the Difference of Gaussians (DoG) between different image scales. The local minima and maxima along the DoG are considered keypoint candidates.

Keypoint Descriptors.

After the keypoint localization step, it is necessary to associate them with appropriate feature/descriptor vectors that correctly encode their semantics. Such generated descriptors are usually based on histograms of gradients, directions of border orientations, or pixel intensities. For example, using the pixel intensity, we have the BRIEF [4] and FREAK [1] that build the feature/descriptor vectors from the relative intensity of pairs of pixels within the keypoint neighborhood.

Descriptors based on gradients have been more used once they present greater efficiency with lighting variation, resizing, and orientation [2]. In SIFT, for instance, the vectors are constructed within 16 subareas around the keypoint. For each subarea, a histogram of the gradient flow is computed along eight directions. Then, by concatenating the histograms of each subarea, a final feature/descriptor of 128 dimensions is created.

Fig. 1.
figure 1

The Bag of Visual Words (BoVW) working diagram. From a dataset partition (referred to as Dictionary Set), keypoints are localized within all images and their feature/descriptors are extracted. A new vector space of feature/descriptors is then created. By grouping the feature/descriptors using a clustering algorithm, a set of visual words \(\varOmega = \left\{ \omega _1, \omega _2, \cdots , \omega _n\right\} \) is created in the keypoint feature/descriptors space. Given a new image \(\textbf{x}\) from the Train Set partition, image keypoints are localized and their feature/descriptors are extracted. Finally, in a vector quantization step, a frequency histogram compute how many keypoints of \(\textbf{x}\) falls into each word of \(\varOmega \).

2.2 Bag of Visual Words (BoVW)

The main idea of BoVW is to create new representations of images as histograms. These histograms are relative to the number of occurrences of specific features/descriptor referred to as visual words. To build these histograms, the following steps are necessary: i) the features/descriptors of a subset of the data are grouped using some clustering algorithm like K-Means [7], the centroids \(\varOmega = \left\{ \omega _1, \omega _2, \cdots , \omega _n \right\} \) resulting from this grouping are then called visual words; ii) given the visual dictionary \(\varOmega \), all features/descriptors extracted within a new image are associated with the visual word closest to them; iii) finally, the histogram that will describe this image is generated by computing the number of occurrences of each word in the image. These steps are summarized in Fig. 1.

3 Methodology

The proposed Class Activation Mapping (CAM) technique for visualizing significant regions of the image that support the current BoVW prediciton can be divided into three steps: (i) generating a correlation matrix between words \(\omega _k, 1 \le k \le K\) (for K visual words) and labels \(c_j, 1 \le j \le J\) (for J classes), (ii) generating a visual heatmap for highlighting the words along the image domain, and (iii) post-processing the BoVW-CAM visualization. These steps are graphically represented in Fig. 2.

Fig. 2.
figure 2

The BoVW-CAM working diagram. The correlation between each visual word \(\omega _k, 1 \le k \le K\) (for K visual words) from the dictionary \(\varOmega \) and the classes \(c_j, 1 \le j \le J\) (for J classes) in the dataset are calculated to generate a \(J \times K\) correlation matrix. Finally, given a new test input composed by image, keypoint, BoVW histogram, and class predicted, each keypoint location in the image domain is highlighted according the correlation of its closest visual word and the predicted class to generate a visual explanation accordingly to the BoVW-CAM.

In the first step, using the feature/descriptors \(\omega _k\) that compose the dictionary \(\varOmega \) of visual words, correlation coefficients between the visual words \(\omega _k, 1 \le k \le K\) and each problem class \(c_j, 1 \le j \le J\) are calculated using the Spearman’s rank correlation coefficient algorithm [23]. Therefore, a correlation matrix is generated where each column represents a dictionary word \(\omega _k\), and each line represents a classification label \(c_j\).

In the second step, an image heatmap is generated from (a) an input image, (b) its BoVW histogram, (c) its keypoints, and (d) the predicted label. Then, each keypoint location in the image domain is highlighted according the correlation of its closest visual word and the predicted class to generate a visualization of the most important keypoints.

Finally, in the third step, operations are applied to improve the previous visualization as a heatmap (Fig. 3). First, a MaxPooling2D is used to facilitate the visual identification of image regions densely occupied by keypoints, followed by Gaussian Blur to attenuate the gray value variations to induce a smooth heatmap. Since the MaxPooling2D is an operation that reduces the input dimension, upsampling the image back to the initial size is necessary. Then, we then have the final BoVW-CAM view relative to the target class. The whole method can be seen in details in the Algorithm 1.

figure a
Fig. 3.
figure 3

Scheme for post-processing the visualization of the most important keypoints for generating thee final BoVW-CAM heatmap. The input image goes through a MaxPooling2D to facilitate the visual identification of image regions densely occupied by keypoints, after that a Gaussian Blur is used to smooth the image gray values to create a smooth heatmap, and finally the image is upsampled to its original size.

4 Experiments

We designed experiments for comparing the most important image regions for classification via Bag of Visual Words and Convolutional Neural Networks (CNNs). To the best of our knowledge, this visual comparison in the image domain is unprecedented in the state-of-the-art.

For the experiments, we used the “Cats vs. Dogs”Footnote 1 dataset, which is a standard benchmark for binary image classification. In total, the set is composed of 12,500 images for each class.

4.1 Experimental Parameters

We used SIFT as keypoint extractor, the classifier used with the BoVW was the SVM, and the clustering algorithm was the K-Means. Finally, we used 256 words to construct the dictionary. With respect to the CNN architecture, we used three convolutional layers followed by two fully connected layers. The ReLU activation function was employed in all the layers except for the last one which was activated by Sigmoid. The evaluated architecture can be seen in Fig. 4. The optimization technique was the RMSProp, the loss function was Binary Crossentropy, the learning rate was 0.001, and the training lasted for 20 epochs.

Fig. 4.
figure 4

CNN architecture used in this work.

The database was divided following two distinct approaches: for training the CNN, 70% of the data were used. For training the BoVW, the previous training partition was divided into two folds of the same size, one for building the dictionary and another for training the classifier. The other partitions in both approaches were made in the same proportion, 10% for validation and 20% for testing.

5 Results

5.1 Visualization

We generated visual explanations for classifications accordingly the learned features via Grad-CAM [21] and the handcrafted features via the proposed BoVW-CAM in Figs. 5 and 6. It is clear that the two approaches focus on different aspects of the images; the BoVW method seems to cover a larger area of the classified object, while CNN focuses on fewer aspects of the image.

Fig. 5.
figure 5

Visualization for cat class with Grad-CAM and BoVW-CAM methods.

Fig. 6.
figure 6

Visualization for dog class with Grad-CAM and BoVW-CAM methods.

5.2 Venn Diagram

It is also possible to reinforce the hypothesis that learned and handcrafted features are focused on different aspects of the evaluated images by building a Venn Diagram of their predictions. In Fig. 7 we can see that 66.45% of the test set are corrected classified by both methods and the CNN classifies correctly more than BoVW. However, a significant amount of images (523) are misclassified by the CNN while corrected classified by the BoVW. This strengthens the fact that it is not so straightforward that Deep Learning methods can totally replace classical methods based on handcrafted features.

Fig. 7.
figure 7

Number of samples corrected classified by the handcrafted and learned features.

5.3 Dice Score

To measure how big is the difference in the image focus between BoVW and CNN, we transformed the visual explanations of Grad-CAM and BoVW-CAM into binary images for the entire test set for calculating the Dice Score [6] between them. As a result, an average of 0.359 with a standard deviation of 0.138 was obtained. This result confirms that there is a high divergence between the aspects observed by handcrafted and learned features.

6 Conclusion

Based on classification accuracy rates, previous works have suggested that there is a divergence between the aspects that handcrafted and learned features focus on images. In this work, we developed a method capable of generating visual explanations for classification algorithms based on BoVW. Then, we could compare our results with the visual explanations generated by a Grad-CAM on a CNN. In this work, we visually compared the most relevant image regions for classifications based on handcrafted features based on keypoints and learned features. The quantitative evaluation via DICE score confirms that the pixels considered by each classification method highly diverge from each other. Furthermore, despite the Deep Learning method having achieved a higher accuracy rate, we showed a significant amount of test data corrected classified exclusively by the BoVW.