Keywords

1 Introduction

Computer vision has been used widely in different contexts like agriculture, medicine, industrial, etc. Part of this success is due to the massive volume of complex data generated daily by these domains. Moreover, with advances in computational resources (e.g., GPUs), it was possible to process (e.g., classify) all these data in an efficient way [2].

A powerful computer vision technique that was boosted by the recent advances was the so-called Convolutional Neural Network (CNN). Although CNNs present good results, they are incapable of gathering and harnessing the relationship between the data. These bonds between the data can considerably improve the effectiveness of the classification process. Then, to reach this important property, it is possible to use GNNs [21] aggregated with CNNs.

As well-known, a graph structure can be highly affected by factors like the number of nodes and edges. This behavior leads to a significant drawback when using GNNs to image classification. The higher the number of edges, the higher the memory footprint and the cost to generate a suitable learning model. The cost of training a GNN is proportional to the number of edges. Moreover, we cannot generate a random or complete graph regarding image classification since it neglects semantic connections and impacts the accuracy of the model. A typical approach to solve this problem and build an appropriate graph is the brute force policy (e.g., grid-search). The problem with this strategy is a high computational cost. Some works tried to consider a priori graph (e.g., knowledge graph) to specific domains to solve this issue [4, 5, 14] but, unfortunately, in dynamic and real scenarios is almost impossible to obtain such a priori structure.

Thus, to mitigate these drawbacks, in this paper, we propose to use GNNs automatically tuned, defining their well-suited connections according to a given image context, improving their efficiency and efficacy. To do that, we based our approach according to the similarity between the graph nodes (i.e., images). Despite its simplicity, our approach achieved notable results. It significantly reduced memory footprint when using GNNs aggregated with CNNs and obtained good accuracy compared with random and complete graphs. Also, we achieve better representativeness of the semantic between an image and its objects to define the global classification context. It is worth mentioning that the literature works [6, 10, 13] tried to apply similar approaches, however disregarding GNNs aggregated with CNNs to define the global image context.

In summary, our main contributions are: i) we proposed an approach capable of tuning and reaching a well-suited GNN structure to improve the effectiveness of image classification context; ii) our approach also diminishes the computational cost of GNNs aggregated with CNNs; iii) through our tuning policy we provide greater structural semantics to the learning model; iv) since our tuning process is based on similarity, our approach can be straightforwardly extended using several literature practices that use the same concept (e.g., different distances, indexes, pruning policies, among others).

2 Background

In [18] the authors presented the seminal work regarding GNNs. They discussed issues about the traditional neural networks handling information between their features. Their work has guided many other GNNs approaches [7, 8, 16]. Indeed, GNNs were motivated by CNNs due to their ability to extract spatial features in many scales and build expressive representations [22]. In [3] the authors introduced convolutional filter usage in GNNs, generalizing the concept of CNNs applied to Euclidean spaces to non-Euclidean ones. Equation 1 formally defines the convolution operation on graphs.

$$\begin{aligned} g_\theta \star \approx \sum ^{K}_{k=0} {\theta }'T_k(L_{sym}) s \end{aligned}$$
(1)

where \(s \in R^n\) is a graph signal, \(g\theta \) is a spectral filter, the symbol \(\star \) is the convolutional operator, \(T_k\) refers to the Chebyshev polynomials, \({\theta }' \in R^{K}\) is a vector of Chebyshev coefficients, and \(L_{sym}\) represents the normalized Laplacian graph \(L_{sym}:= D^{-\frac{1}{2}} LD^{-\frac{1}{2}}\), considering the value for Laplacian Graph as \(L := D - A\), \(A=[a_{ij}]\) as the adjacency matrix non-negative and \(D = diag(d1, d2, ..., dn)\) as the degree matrix of A where \(d_i\) = \(\sum j a_{ij}\) is the degree of vertex i [13].

In [12] it was proposed a graph convolutional network (GCN) as defined by Eq. 2. The authors simplified the model limiting nearest neighbors value in the convolution to \(k=1\), and approximating the largest eigenvalue (\(\lambda _{max}\)) of \(L_{sym}\) by 2.

$$\begin{aligned} g_\theta \star s = \theta (I + D^{-\frac{1}{2}} AD^{-\frac{1}{2}})s, \end{aligned}$$
(2)

where \(\theta \) is considered the remained Chebyshev coefficient. The next process consists in applying a normalization trick to the convolution matrix (see Eq. 3):

$$\begin{aligned} I + D^{-\frac{1}{2}} AD^{-\frac{1}{2}} \rightarrow \widetilde{D}^{-\frac{1}{2}} \widetilde{A}\widetilde{D}^{-\frac{1}{2}}, \end{aligned}$$
(3)

where \(\widetilde{A} = A + I\) and \(\widetilde{D} = \sum {j}\widetilde{A}_{ij}\).

Finally, the GCN itself can be described in Eq. 4.

$$\begin{aligned} H^{(l+1)} = \sigma (\widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}H^{(l)}\Theta ^{(l)}) \end{aligned}$$
(4)

where \(H^{(l)}\) is the matrix of activations for the l-th layer, while \(H^{(0)} = X, \Theta ^{(l)} \in R^{a \times f}\) is considered as the trainable weight matrix in the layer l, and \(\sigma \) the activation function (i.e., \(ReLU(.) = max(0,.)\)) [13].

Although very promising, the work of [12] presents some issues regarding the memory footprint, and it is applied to text classification (i.e., the graph structure is intrinsically defined). Moreover, it assumes that the edges have equal relevance to generate the learning model. Hence, it neglects that in real scenarios, each connection can show a different contribution. Besides, in [13] it is described that GCNs lose their representation power when several convolutional layers are added to the architecture, and the performance complexity is prohibitive.

We believe that our approach can cope with these issues since we consider different relevant levels to edges according to the image context. Besides, through this relevance, we can remove useless edges (i.e., not contribute to gather and harness the semantic relationship between an image and its objects). In [6] the authors proposed a technique based on sharing network information to define the importance of a given edge in a tree structure regarding traditional data. However, to the best of our knowledge, our work is the first to consider a relevance mechanism in GNNs aggregated with CNNs to define the global context of images (complex data).

3 Proposed Approach

To define the well-suited connections for the GNNs according to a given image context, we present the proposed approach pipeline, as illustrated in Fig. 1. In the first step, we obtain the objects (bounding boxes) from the images of the dataset. After that, in Step 2, features of the bounding boxes are extracted through pre-trained CNNs (e.g., ImageNet transfer learning). In Step 3 the bounding boxes belonging to the same image become nodes in a complete subgraph.

Fig. 1.
figure 1

Pipeline of the proposed approach.

Next, for each image is built a subgraph G(VE), where V represents the bounding boxes of the images and E their relations. Each bounding box from a given image is connected with their respective siblings (i.e., remaining bounding boxes from the same image). Then, in Step 4, an affinity matrix is generated to describe the entire graph (composed of several subgraphs). It is easy to see that the entire graph comprises several subgraphs, and the number of subgraphs is equal to the number of images in the dataset.

In Step 5, we automatically fine-tune the generated graph, defining their well-suited connections according to a given image context. To do so, we compute the relevance of each vertex according to the mean distance regarding its incident vertices. It is worth mentioning that we assign for each edge a relevance based on the dissimilarity between its nodes. The vertex’s relevance is used as a threshold (\(\tau \)) to suppress useless incident edges. Equation 5 formally defines the threshold setting. Figure 2 illustrates an example of the complete graph (left graph), and it is respectively fine-tuned one (right), obtained after applying our proposed approach suppressing edges according to the automatic threshold.

$$\begin{aligned} \tau = \frac{{\sum _{k=0}^{l-1}w(n'_{k},n_{k}'')}}{l} \end{aligned}$$
(5)

where l represents the number of edges, k is an iterator, \(r(v'_{k}, v_{k}'')\) is a function to obtain the relevance of the edge connecting vertices \(v'_{k}\) and \(v''_{k}\).

After, in Step 6, our approach creates the affinity matrix of the fine-tuned graph. Finally, in Step 7, this matrix is the input to train a GNN, generating a learning model.

In this paper, we create an instance of our proposed approach to calculate the edge’s relevance through the well-known Euclidean distance. However, our approach can be straightforwardly extended to different kinds of distance functions, as well as tuning mechanisms. Algorithms 1 and 2 detail our proposed approach and the tuning mechanism considered in the present paper.

Fig. 2.
figure 2

Illustration of the tuning process in a fully connected graph

An important property of our proposed approach is that we do not need to know the bounding boxes labels to predict the global context of the image. As aforementioned in Sect. 2, an a priori knowledge graph is costly and tiresome to obtain. Hence, a complete graph is easier to build. However, it also demands a high computational cost and can link incoherent objects, confusing the learning model.

figure a

4 Experiments

This section discusses the scenarios of the experiments, describes the image dataset used in the experiments, and presents the discussion about the obtained results comparing our proposed approach against complete and random graphs, respectively.

figure b

4.1 Scenarios

To obtain the best hyperparameters to the GCN we executed a grid-search strategy [1]. To do so, we defined the number of hidden layers (16, 64, 256), epochs (2000), learning rate (0.001, 0.005, 0.01, 0.05), and dropout (0.3, 0.5, 0.8, 0.9). These possibilities of hyperparameters resulted in 384 experiments.

We combined the GNN with different CNN architectures. Then, we used EfficientNetB7 [20], InceptionV3 [19], ResNet50 [9] and VGG19 [17]. Finally, to corroborate the effectiveness of our approach we compared it against a fully connected and random graphs. As optimizer we used ADAM [11].

To perform the experiments, we used a computer with Intel Core i7 from the 6th generation with 8 cores, 16 threads, and 3.40 GHz; 32 GB of RAM and an Nvidia GeForce RTX 2080Ti GPU, with 4352 CUDA cores. The experiments were executed in GPU mode.

4.2 Dataset Description

For the experiments, the MIT67 [15] dataset was chosen because it provides the requirements for the development of this work, as the: images, bounding boxes of each image, and global classes (image classes).

The MIT67 is a dataset to solve indoor scene recognition, including 67 different classes. However, because some classes have few examples, annotation errors, missing data, or images without bounding boxes, data cleaning was required, which resulted in the exclusion of the following classes: auditorium, bowling, elevator, jewelry shop, locker room, hospital room, restaurant kitchen, subway, laboratory wet, movie theater, museum, nursery, operating room, waiting room. Thus, obtaining 53 classes, 2607 images, and 50.8 68 bounding boxes, that were divided into the training (80% of the data) and test (20%) sets in a random and stratified way.

Table 1 shows the distribution of images/objects by class and bounding boxes by image. It is possible to note that the dataset is unbalanced, having a high number of samples for some classes, such as: “kitchen” (308 images, 7511 bounding boxes), “bedroom” (350, 5112); and low numbers for others like “cloister” (16, 159) and “winecellar” (16, 222).

Table 1. MIT67 dataset distribution.

4.3 Results

To perform the experiments, we trained GCNs aggregated with the four CNNs architectures cited in Sect. 4.1. Tables 2 and 3 show the obtained results regarding the fully connected graph against our approach, respectively. We calculated metrics considering the efficacy, such as accuracy, precision, recall, and F1, to analyze the results. We also show the total number of edges generated and the dimensionality of the feature vectors used to represent the nodes.

Table 2. Results for top 1 performance for fully connected graph on MIT67 dataset.

Analyzing Tables 2 and 3 we can see that our approach statistically ties with the vanilla graph (fully connected graph). Moreover, our approach decreased to a great extent the number of edges of the graph.

For instance, the fully connected graph and our approach presented \(13.95\times 10^5\) and \(7.13\times 10^{5}\), respectively, regarding the ResNet50. Thus, our approach reduced the memory footprint up to 96%. We observed this same behavior when analyzing the other CNNs aggregated with our GNN. According to the results, our approach with EfficientNetB7, InceptionV3, and VGG19 achieved reductions of up to 73%, 83%, and 51%, respectively.

Table 3. Results for top 1 performance considering our approach on MIT67 dataset.

To better visualize all the considered metrics, in Figs. 3 (a) and (b), we generated the radar plots with the obtained results regarding the fully connected graph and our approach, respectively. It is clear to note that our approach accomplished a considerable edge reduction while maintaining efficacy.

Fig. 3.
figure 3

Radar plot using different CNN architectures; (a) Fully connected graph; (b) Pruned graph.

We also performed experiments considering random connections. To create the random edges and obtain a fair comparison, we used the same number of edges obtained by our approach. It is important because using a higher or lower number of random edges, when compared with ours, could lead to a false degeneration of the graph or a false improvement (higher computational cost). Thus, our approach can also answer which is the best number of edges to reach a suitable trade-off between efficacy and efficiency.

The experiment regarding the random edges achieved an accuracy of up to 60% with InceptionV3. The same behavior, where our approach presented the best results, was observed regarding the other CNNs. This testifies that our approach effectively defines the edges’ relevance, capturing the semantic relationship between the objects of an image to define its global context.

Thus, considering the obtained results, we can argue that our proposed approach was capable of automatically tunning the GNN graph structure aggregated with a given CNN, defining well-suited connections according to a given context, improving the entire process.

5 Conclusions

In this paper, we proposed an approach capable of automatically fine-tuning a given GNN. To do so, it defines the well-suited connections according to a given image context, improving the effectiveness of the entire process. Our approach was based on the similarity between the graph nodes (i.e., images).

Despite its simplicity, it reached considerable results. It not only successfully reduced the memory footprint when using GNNs aggregated with CNNs, but also maintained good accuracies when compared with random and complete graphs. This testifies that our approach achieved better representativeness of the semantic between an image and its objects to define the global classification context. Our results showed that the fine-tuned GNN reached up to 96% regarding the memory footprint while maintaining the accuracy.

For future works, we intend to explore other image datasets. We also aim to extend our approach regarding different distances and detection mechanisms for edge relevance.