1 Introduction

Multi-label image classification (MLIC) deals with assigning multiple labels to each image, and it has been applied in many fields, including multi-object recognition [17], medical diagnosis recognition [11] and Person re-identification [26]. The recent progress is mainly made by exploiting label correlations and learning semantic representations with deep learning models.

Modeling label correlations has been long studied in multi-label classification and has been demonstrated very effective because correlated labels are highly likely to co-occur [4]. For an image recognition task, convolutional neural networks (CNNs) [10, 12, 24] and unsupervised feature extraction methods [14, 15, 39] have been widely applied to extract image features. Recently, many approaches to MLIC are proposed based on combining CNNs and exploiting label correlations, e.g., [3, 4, 23]. In these approaches, an existing deep learning model is usually employed as a tool to transform an image into a high-level abstract representation. But objects of interest may only be in certain regions of an image. Some recent studies [2, 5, 28, 35, 41] mentioned that semantic label embeddings can make the model generating more likely label combinations in prediction stage. Therefore, these works utilize the correlation between labels and generate semantics to guide the learning of semantic representations of images.

This paper advances this research direction by improving both the modeling of label correlations and the learning of semantic representations. On the one hand, in [38], the authors mentioned that higher-order label correlations have stronger modeling ability for multi-label classification. Therefore, unlike existing approaches only learn local semantics of each label, we propose to explore global semantics shared by multiple labels. On the other hand, existing approaches mainly learn the semantic representations at the last convolutional layer of a CNN. But it has been noted that the image representations at different layers of CNN capture different levels or scales of features and have different discriminative abilities [1, 10, 16, 32]. In addition, applying spatial attention can flexibly and adaptively aggregate semantic information and focus on the regions of interest [40]. Therefore, better performance might be achieved by simultaneously exploiting features at multiple layers of a CNN and position-wisely combining the image representations learning with label embeddings.

To realize the proposals mentioned above, we design a novel Multi-layered Semantic Representation Network (MSRN). MSRN generates both label-specific and group embeddings which capture local semantics and global semantics respectively, and then combines them with multiple layers of a CNN by an attention mechanism to learn label and group shared semantic representations of images. To be specific, first, we introduce LGE (Label-Group Embedding) module to capture both local semantics of each label and semantics of a group of labels in embeddings based on the label co-occurrence graph. Second, we propose SGA (Semantic Guided Attention) module to position-wisely guide the CNN to focus on the regions of interest. Third, we design a framework to combine the LGE module with the multiple layers of a CNN through the attention mechanism built in the SGA module. We conduct experiments on five benchmark multi-label image datasets including VOC2007, VOC2012, NUS-WIDE, MC-COCO, and Apparel. The experimental results show that our method outperforms state-of-the-art approaches.

The contributions of this paper are summarized as follows:

  • A Multi-layered Semantic Representation Network (MSRN) is designed for multi-label image classification.

  • Second-order and high-order label correlations are considered simultaneously to improve the performance of multi-label image classification.

  • Semantic representations are learned at multiple layers through the position-wisely attention mechanism by modeling label correlations.

The rest of the paper is organized as follows. Section 2 introduces related work. Section 3 presents the proposed method. Section 4 presents empirical evaluation. Section 5 concludes this paper and introduces future work.

2 Related work

In MLIC, images are annotated with multiple labels simultaneously where labels usually have correlations. It has been demonstrated that exploiting label correlations can significantly improve the performance [38]. Recent progress has been made by employing deep learning models, especially convolutional neural networks. Wang et al. [27] extract label semantics and associate it to Recurrent Neural Network (RNN). In addition, Lee et al. [19] apply knowledge graphs to exploit the label dependencies based on the label co-occurrence graph. ML-GCN [4] learns the semantic label embeddings through Graph Convolution Network (GCN), and applies it as inter-dependent object classifiers at the prediction stage. In [29], a label graph superimposing framework is proposed to exploit label correlations. The label graph is constructed by superimposing statistical label graph into knowledge prior oriented graph, which, however, is usually unavailable in real applications.

Some studies further locate regions of interest because each class label might be determined by some specific regions of an image. Examples include [33, 42] which apply bounding box to focus on the regions of proposal. To learn regions with arbitrary boundaries, more studies propose attention based methods where attention is a spatial weight map representing relative importance among pixels [16]. SRN [41] is an end-to-end CNN model which trains learnable convolutions on the attention maps of labels. In [3], Chen et al. propose an order-free RNN based model for multi-label image classification, which uniquely integrates the learning of visual attention and Long Short Term Memory (LSTM) layers to jointly learn the labels of interest and their co-occurrences.

Recently, some methods [4, 5, 35] apply Graph Neural Network (GNN) techniques to generate semantic label embeddings which can be utilized as visual attention for multi-label image classification. You et al. [35] propose a method of computing cosine similarity between label embeddings to exploit label dependencies. Chen et al. [5] apply a GNN with graph propagation mechanism to exploit the interaction between DNN and label dependencies. Despite having achieved high performance on multi-label image classification, these methods do not explicitly consider high-order label dependencies which may result in semantics shared by a group of labels [38]. Moreover, to the best of our knowledge, no existing approaches utilize image representations extracted from multiple layers of a CNN.

3 Method

3.1 Architecture

The architecture of the proposed MSRN is shown in Fig. 1. We design an LGE module to generate label and group embeddings with the input of a graph \(\mathcal {G} = \{V, A\}\). \(V = \{v_{i}\}_{i=1}^n\) is the feature matrix of labels where \(v_{i}\) is a vector of features and n is the number of labels, and \({A} = \{a_{ij}\}_{i,j=1}^n\) is the adjacency matrix about label co-occurrence. The outputs of LGE module are label embeddings \(E_l= \{e_l^i\}_{i=1}^n \in \mathbb {R}^{n\times d}\) and group embeddings \(E_g = \{e_g^i\}_{i=1}^m\in \mathbb {R}^{m\times d}\), where m is the number of groups, and d is the dimension of the embeddings.

Fig. 1
figure 1

Overall architecture of our MSRN model. Given an image, a CNN network outputs the image features from different layers to different branches. At the same time, LGE Module generates the the label and the group embeddings. Then, the SGA module produces label-level semantic representations and group-level semantic representations of the image by combining each image feature map from each branch. At last, we concatenate the generated semantic representations and perform the classification

The backbone in our framework can be any kind of CNNs, such as VGG [24], ResNet [12] and DenseNet [10]. In this paper, Resnet-101 is chosen for experiment. Given an input image, \(\overline{F} = \{\overline{f}_{b }\}_{i =1}^B\) indicates the output image features of different branches, where B is the total number of branches, and \({\overline{f}_{b }} \in \mathbb {R}^{W _b \times H _b \times C _b }\) is image feature for the b-th branch with spatial resolution \(W_b \times H_b\) and channel \(C_b\). The branches marked as blue lines in Fig. 1 are used to receive the image feature map \(\overline{f}_{b }\) from corresponding layers in the CNN. We provide three branches in our work to receive the output image features from the last layer of the last three blocks of Resnet-101. Since the channel \(C_b\) of image features from different layers of CNN are distinct, we use a convolutional layer with 1\(\times\)1 kernel to project image features from \({\overline{f}}_b\) to \({f_{b }} ={conv}^{ 1\times 1}(\overline{f}_b )\in \mathbb {R}^{W _b \times H _b \times d }\) which has the same dimension d as the label embeddings \(E_l\) and group embeddings \(E_g\).

Then we propose an SGA module to position-wisely combine the image feature maps \(F = \{{f}_{b }\}_{i =1}^B\) with label embeddings \(E_l\) and group embeddings \(E_g\). The outputs of SGA module are label semantic representations \(O =\{{o}_{b }\}_{b =1}^B\) and group shared semantic representations \(Q= \{{q}_{b}\}_{b=1}^B\), where \(o_b\in \mathbb {R}^{n\times d}\) and \(q_b\in \mathbb {R}^{m\times d}\). Some of the existing approaches utilize the label-specific representation for each label. But since in real applications the labeling results of a dataset usually have noisy or missing labels, the label-specific semantic representation might not be sufficient enough to predict correct labels. Therefore, in the final stage of our framework, we concatenate the generated label and group shared semantic representations into \(M=[O||Q]\), and apply fully connected layers to perform the prediction where the cross entropy loss function is adopted as follows:

$$\begin{aligned} \mathcal {L}_{1}= \sum _{i=1}^{n}y^{i}\log (\sigma (\hat{y}^{i})) + (1-y^{i})\log (1-\sigma (\hat{y}^{i})), \end{aligned}$$
(1)

where \(y^{i}\) is equal to 1 or 0 for image i in terms of a certain label, \(\hat{y}^{i}\) is the output of fully connected layer, and \(\sigma (\cdot )\) is the sigmoid function.

3.2 Label-group embedding module

Since label correlation is important information in multi-label image classification as we mentioned in Sect. 1, we build a Label-Group Embedding (LGE) module to generate semantic label embeddings \(E_l\) and group embeddings \(E_g\).

3.2.1 Semantic label embeddings

Graph attention Networks (GAT) [25] is a self-attention based model which is most frequently used for learning embeddings of graph-structured data. With GAT algorithm, we can obtain the semantic label embeddings \(E_l\) from the label graph \(\mathcal {G}\). In our model, GAT first produces the attention coefficient \(\alpha _{{ij}}\) between the i-th and j-th label as follows:

$$\begin{aligned} \alpha _{{ij}} = \frac{\exp (\text {LeakyReLU}(P[{U}v_{i}||{U}v_{j}]))}{\sum _{k \in \mathcal {N}_i} \exp (\text {LeakyReLU}(P[{U}v_{i}||{U}v_{k}]))}, \end{aligned}$$
(2)

where \(P \in \mathbb {R}^{1 \times 2w}\) and \(U\in \mathbb {R}^{w\times v}\) are two learnable weight matrices, v and w equal to the input and output of feature dimension of the GAT layer respectively, \(\mathcal N_i\) is the set of neighborhoods of label i in the graph, and || represents the concatenation operation. The negative input slope in LeakyReLU is set to be 0.2 in our work. Then, we can obtain label embeddings \(E_l^1 = \{e_l^i\}_{i=1}^n\) from the first GAT layer by linearly combining attention coefficients \(\alpha\) with the transformed label features:

$$\begin{aligned} e_l^i =\sigma \left( \sum _{j \in \mathcal {N}_i}\alpha _{{ij}}{U}v_{j} + {U}v_{i}\right) , \end{aligned}$$
(3)

where the \(\sigma (\cdot )\) is non-linear activation function which is ELU in our method. For simplicity, \(\text {GAT}_t(\cdot )\) is used to represent the t-th GAT layer that consists of Eqs. (2) and (3), and the semantic label embeddings \(E_l^t\) can be generated by the following equation:

$$\begin{aligned} E_l^t = \text {GAT}_{t}(E_l^{t-1}, A) , \end{aligned}$$
(4)

where \(E_l^0=V\) is the original feature matrix of labels.

3.2.2 Semantic group embeddings

Differentiable graph pooling (Diffpool) [34] is a graph clustering algorithm that soft map graph nodes to a set of clusters. Once we capture the semantic label embeddings \(E_l\), we can apply Diffpool to generate semantic group embeddings \(E_g\) as

$$\begin{aligned} E_g =\text {Diffpool}(E_l,A). \end{aligned}$$
(5)

Moreover, in order to learn more compact group embeddings, we try to minimize the distance between the group embeddings \(E_g\) and the labels embeddings \(E_l\) as follow

$$\begin{aligned} \mathcal {L}_{2} = \sum _{k=1}^{m} \sum _{ E_l^{i}\in C_{k}} \Vert E_g^{k} - E_l^{i}\Vert _2^{2}, \end{aligned}$$
(6)

where \(C_{k}\) indicates the k-th cluster of labels which are highly correlated labels.

3.3 Semantic guided attention module

The aim of SGA module is to utilize the semantic embeddings \(E_l\) and \(E_g\) to guide the learning of semantic representations of images at different branches. As the feature contained in each position (w, h) of an image feature map could be correlated to the semantics of the label embeddings, we propose a position-wise attention mechanism to fully combine the image feature space and the semantic embedding space. Similar to existing studies [5, 18], we adopt the Hadamard product between each position (wh) of an image feature map from the b-th branch and the label, group embeddings to calculate the attention weights as

$$\begin{aligned} {sl_{b}}_{w,h}^i=f^{w,h}_{b} \odot e_l^i,~~~ {sg_{b}}_{w,h}^j=f^{w,h}_{b} \odot e_g^j, \end{aligned}$$
(7)

where the \(\odot\) is Hadamard product, \({sl_{b}}_{w,h} \in \mathbb {R}^{1 \times 1 \times n \times d}\) and \({sg_{b}}_{w,h} \in \mathbb {R}^{1 \times 1 \times m \times d}\). Then we apply normalization to the computed compatibility scores \(al_{b} \in \mathbb {R}^{W_b \times H_b \times n \times d}\) and \(ag_{b} \in \mathbb {R}^{W_b \times H_b \times m \times d}\)

$$\begin{aligned} al^{w,h}_{b} = \frac{\exp ({sl_{b}}_{w,h} )}{\sum _{x,y}\exp ({sl_{b}}_{x,y})}, ag^{w,h}_{b} = \frac{\exp ({sg_{b}}_{w,h} )}{\sum _{x,y}\exp ({sg_{b}}_{x,y})}. \end{aligned}$$
(8)

Once obtained the normalized compatibility scores, we apply the second Hadamard product to generate position-wise attention maps.

$$\begin{aligned} o_{b} = \sum _{w,h}al^{w,h}_{b} \odot f^{w,h}_{b}, ~~~q_{b} = \sum _{w,h}ag^{w,h}_{b} \odot f^{w,h}_{b}. \end{aligned}$$
(9)

Finally, the total training loss is \(\mathcal {L}_{1} + \lambda \mathcal {L}_{2}\), where \(\lambda\) is a regularization parameter.

3.4 Model prediction

In the test stage, we concatenate the local semantic representations \(O=\{o_b\}_{b=1}^B\in \mathbb {R}^{n\times (Bd)}\) and group shared semantic representations \(Q=\{q_b\}_{b=1}^B\in \mathbb {R}^{m\times (Bd)}\) and predict the labels by \(\hat{y}^{i}={fc}_{2}(\text {LeakeyReLU}({fc}_{1}(\text {tanh}(M))))\), where \(M=[O||Q]\), \({fc}_{1}\) and \({fc}_{2}\) are fully connected layers. It should be noted that the label co-occurrence information used in training and testing stage also the same adjacency matrix that computed based on label co-occurrence information of training data set.

4 Empirical evaluation

In this section, we will describe the implementation details of our proposed model MSRN and the experimental results.

4.1 Implementation details and evaluation metrics

The input label features V are 300-dimensional Glove features pretrained on Wikipedia dataset. The backbone ResNet-101 is pretrained on ImageNet for accelerating training process. We remove the last average pooling layer and classifier of Resnet-101 and apply the MaxPooling with kernel size \(2\times 2\) and stride 2 to obtain image features, \(\overline{F}\), from last three building blocks of ResNet-101. The output dimension of \({fc}_{1}\) is 2048, and the output dimension of \({fc}_{2}\) is the same as the number of labels. In addition, to reduce the impact of branches corresponding to lower layers of the backbone on gradients, we add a buffer convolutional layer [1] with kernel size \(1\times 1\) and stride 1 before we obtain image features from the last two branches. Both the output feature dimension of first GAT layer and input feature dimension of second layer are 300 and the output feature dimension of second GAT layer is 512. The number of groups of labels m is set as 4. The regularization parameter \(\lambda\) is set to 0.001. The input image is resized to 448 \(\times\) 448 for both training and testing. We train our model on one Tesla V100-16GB GPU and set the batch size to 8. For optimization, we apply SGD as optimizer with momentum 0.9 and weight decay 10\(^{-4}\). The initial learning rate is set to 0.01 and it is decreased by \(*0.1\) on each 30 epochs in total 90 epochs.Footnote 1

The evaluation metrics we used in our experiments include mean average precision (mAP) over all categories, precision (CP, OP), recall (CR, OR), and F1 score (CF1, OF1).

Table 1 Comparison of mAP and AP (in %) of our method and state-of-the-art methods on Pascal VOC2007 dataset where numbers in bold indicate the best performance and numbers underlined indicate the second performance
Table 2 Comparison of mAP and AP (in %) of our method and state-of-the-art methods on Pascal VOC2012 dataset where numbers in bold indicate the best performance and numbers underlined indicate the second performance
Table 3 Comparison of our method with state-of-the-art methods on MS-COCO dataset where numbers in bold indicate the best performance and numbers underlined indicate the second performance

4.2 Experimental results

VOC2007 [8] We compare our method with ResNet-101 [12], CNN-RNN [27], AR [2], ML-GCN [4], FGCN [30], SSGRL [5], SIGCN [6]. Following [5], we also pretrain our model on the MS-COCO dataset. The results of all the methods are shown in Table 1. We can see that the result of MSRN(pre) is 2%, 1.9%, and 1% better than ML-GCN [4], FGCN [30], SSGRL [5], SIGCN [6] on mAP respectively. It should be noted that the input image size of SSGRL(pre) is 576\(\times\)576 which is larger than ours. Our model also achieves the best AP score on 17 categories. The results definitely demonstrate the effectiveness of modeling multi-layered semantic representations.

VOC2012 [9] We also perform the experiments on VOC2012 dataset and Compare with the RMIC [13], VGG+SVM [24], FeV+LV [33], HCP+AGS [31], SSGRL [5] and SIGCN [6]. The experimental results are shown in Table 2, our method achieves the best AP score on 11 categories. Our method also outperform 0.1%, 0.2% mAP than SSGRL(576 image size) and SIGCN respectively. When applying with the COCO pretrained model, our method is also better than SSGRL(pre) 0.2% on mAP. The results show that our method achieves a competitive performance comparing with other works.

MS-COCO [21] The comparison results on MS-COCO dataset are shown in Table 3. We compare our method with ResNet-101 [12], ML-GCN [4], FGCN [30], CMA [35] and SIGCN [6]. Our method can achieve 83.4% mAP score which is in the first rank. Our model achieves comparable performance with the state-of-the-art methods. Although SSGRL and MS-CMA is 0.4% better than our method with 448 input image size in most of metrics, these two methods have a more advantageous experimental setup, such as larger image size and multi-scale training. Therefore, we also test our method by setting inputting image as 576 size. Specifically, MSRN wins the first place in terms of mAP, CR, CF1, OR, and OF1.For the results on top-3 labels, MSRN obtains the best performance in terms of OR and OF1.

NUS-WIDE [7] contains 269,648 images and 81 concepts. The dataset is split by following [35]. We compare MSRN with CNN-RNN [27], ML-GCN [4], GATN [36], AT [37], S-CLs [22] and CMA [35]. As shown in Table 4, our model achieves 0.1% better than MS-CMA on mAP. In addition, MSRN achieves the best performance in terms of CF1, OF1 and CF1-3. Both our model and MS-CMA [35] extract image features from lower layers of CNNs, but our model outperforms it by 0.1%, 0.2%, 0.2% and 0.4% in terms of mAP, CF1, OF1 and CF1-3, respectively.

Table 4 Comparion with state-of-the-art methods on NUS-WIDE dataset where numbers in bold indicate the best performance and numbers underlined indicate the second performance

ApparelFootnote 2 is a clothing dataset for multi-label image classification. We test our model on it and make comparisons with ResNet-101 [12], SSGRL [5] and ML-GCN [4]. In our experiment, we randomly select 50% images from the dataset for training, and other 50% images for testing. The result in Table 5 shows that our model achieves 99.65 mAP score. Our model is 0.18% better than ResNet-101 and 0.06% better than the current best model on mAP. Our model also achieves the best score on all metrics that we employ.

Table 5 Comparison with state-of-the-art methods on Apparel dataset where numbers in bold indicate the best performance and numbers underlined indicate the second performance

4.3 Ablation studies

In this section, we perform ablation studies to evaluate the effectiveness of different components of our framework.

Label and group embeddings To verify the effectiveness of label and group embeddings, we conduct experiments with three simplified versions of our proposed method MSRN, i.e., label-E (only using label embedding), group-E (only using group embedding), and no LGE module. The results shown in Table 6 clearly indicate the effectiveness of label and group embeddings.

Table 6 Comparison among different versions of MSRN

Number of branches As ResNet-101 contains four blocks, we conduct experiments to validate whether the multi-branch architecture is better than the single-branch architecture and whether the model performs better with all branches. The experimental results are shown in Table 7. We can find that the multi-branch architecture can improve at least 0.12% compared to the single-branch architecture, and achieves the best performance with the last 3 branches.

Table 7 Comparison among different number of branches

4.4 Parameter sensitivity

In this section, we study the sensitivity of MSRN to two hyper-parameters, i.e., the number of groups of labels m and the regularization parameter \(\lambda\). Due to the limitation of space, we only present the analyses on VOC2007 dataset. For the number of groups m, we conduct experiments of six different cases corresponding to 2, 4, 6, 8, 10 and 20, respectively, with \(\lambda\) fixed as \(10^{-3}\). The experimental results in Table 8 show that the performance in terms of mAP is not much sensitive to m. For \(\lambda\), we study the values from \(\{10^{-1}, 10^{-2},\ldots , 10^{-6}\}\). The results with different values of \(\lambda\) are shown in Table 8, which shows the performance is not much sensitive to \(\lambda\).

Table 8 Comparison among different values of m and \(\lambda\)

5 Conclusion and future work

This paper proposes a novel Multi-layered Semantic Representation Network (MSRN) for multi-label image classification. MSRN for the first time considers both local semantics and global semantics of labels through modeling label correlations, and learns semantic representations of images at multiple layers of a convolutional neural network through an attention mechanism. Extensive experiments show that MSRN outperforms many state-of-art methods on VOC2007, VOC2012, MS-COCO, NUS-WIDE and Apparel datasets. In the future, we will improve our method to explicitly utilize labels which exist but are unobservable due to lack of labeling efforts.