Keywords

1 Introduction

Image multi-label classification (MLC) is a typical computer vision problem that classifies the presence (positive) or absence (negative) of multiple categories in each image. As an image usually contains multiple objects or concepts, it is more practical than its counterpart single-label classification and hence has a wide range of applications like medical image interpretation [6, 7, 21].

A crucial challenge of training convolutional neural networks (CNNs) for image MLC is the training data is often partially labeled [17, 27]. That is, for each image sample, only the labels on some categories are known, and the rest are unknown. It is because the manual collection of fully labeled data is expensive [13], especially when the numbers of categories and samples are very large.

A popular and effective solution for training CNN with partially labeled data is treating all unknown labels as negative labels [2, 3, 26, 34], named Negative mode [1]. This mode is based on the prior knowledge of MLC datasets that negative labels are usually much more than positive labels [28]. Nevertheless, this mode produces wrong labels to the training data, as some unknown labels’ ground truths are positive labels instead of negative labels. These wrong labels are usually unevenly distributed over different categories [1]. The categories with more wrong labels suffer from more harm. Therefore, different categories suffer from varying degrees of performance decreases.

On the other hand, another solution is ignoring the contributions of unknown labels [1, 13], named Ignore mode [1]. This mode may be less effective than Negative mode [26], as it does not utilize the prior knowledge that negative labels are in the majority. Even so, it ensures the training data have no additional wrong labels, which is a vital advantage that Negative mode lacks. Therefore, several work utilize this vital advantage of Ignore to improve Negative mode for training CNNs beginning with initial parameters [1, 26].

In this paper, we propose Category-wise Fine-Tuning (CFT), a new post-training method that can be applied to a CNN that has been trained with Negative mode to improve its binary classification performance on each category independently. Therefore, CFT is very different from most approaches that train a CNN from initial parameters [1, 26]. Specifically, CFT uses Ignore mode to one-by-one fine-tune the logistic regressions (LRs) in the classification layer, in which each LR outputs the binary classification result on one category. The use of Ignore mode reduces the performance decreases caused by the wrong labels of Negative mode during training. The one-by-one fine-tuning can improve the performance on each category independently without affecting the performance on other categories.

While applying CFT to a CNN, the LRs may prefer different fine-tuning configurations (optimization methods, methods for handling untypical labels in particular MLC datasets, etc.) to achieve higher performance. Therefore, we additionally use a greedy selection for CFT to enable choosing the best configuration for each LR from multiple configuration candidates.

During experiments, we found using binary crossentropy (BCE) loss with backpropagation for fine-tuning an LR sometimes unwantedly decreases the performance like AUC (area under the receiver operating characteristic curve). On the other hand, Genetic Algorithm (GA) [29] for fine-tuning can directly improve the performance, avoiding performance drops caused by minimizing BCE.

Sufficient experiments were conducted on the CheXpert [21] competition dataset and the partially labeled versions of the MS-COCO [28] standard MLC dataset to evaluate the effectiveness of our methods. Especially, our methods achieve state-of-the-art on the CheXpert dataset, to the best of our knowledge. We submitted a single CNN to the competition serverFootnote 1 for the official evaluation on the test set. It achieves mAUC 91.82%, which is the highest single model score in the leaderboard and literature. After that, the competition server was closed and the test set is released. Therefore, our ensemble composed of 5 single CNNs was evaluated on a local machine and achieves mAUC 93.33% on the test set, superior to the best in the leaderboard and literature (mAUC 93.05% [44]).

2 Related Work

Several approaches were proposed to address MLC with partial labels. Binary Relevance [15] converts MLC to multiple binary classification tasks, but it usually fails to model the label dependencies and is less scalable to a large number of categories. [23, 41, 43] adopted low-rank learning, [39] used a mixed graph to encode a network of label dependency, [3, 12] predicted unknown labels by learning label relations, and [8, 24, 38] predicted unknown labels by posterior inference. However, most of these approaches cannot be well-adapted for training deep models, as they require putting all training data into memory or solving costly optimization problems.

Some approaches train deep models with partial labels by exploiting image and category dependencies. Durand et al. [13] proposed predicting unknown labels based on curriculum learning with graph neural networks to model the correlations between categories. IMCL [20] interactively learns a model with a similarity learner which discovers label and image dependencies. SST [5] and HST [4] explore the image-specific occurrence and category-specific feature similarities to complement unknown labels. SARB [32] complements unknown labels by learning and blending category-specific feature representation across different images. However, most of these approaches require particular model architectures or training schemes.

Negative mode and Ignore mode are more prevalent in contrast with the complex approaches aforementioned. Ignore mode simply ignores the contributions of unknown labels (e.g., partial-BCE loss [13] and partial asymmetric loss [1]) while Negative mode [2, 3, 26, 34] treats all unknown labels as negative labels. Several work (including this paper) aim to improve Negative mode with Ignore mode, as introduced in Sect. 1. Kundu et al. [26] proposed a method to soften the signal of the wrong labels of Negative mode by exploiting the image and label relationships, but it does not avoid some categories training on too many wrong labels. Ben-Brunch et al. [1] proposed Selective approach that can adjust the training mode for each category to be either Negative or Ignore, but it requires the presence frequency of every category which is unavailable in partially labeled datasets.

Unlike most previous approaches that aim to train high performance models beginning with initial parameters, the proposed CFT is a post-training method based on Ignore mode that can be applied to models trained with Negative mode to further improve the performance. Moreover, CFT can independently improve the classification performance on each category. Hence, CFT may be able to further improve the performance of the models trained with other approaches mentioned above.

3 Methods

This section presents the proposed CFT, the greedy selection for selecting fine-tuning configurations, and GA for fine-tuning, as summarized in Fig. 1.

Notations. Considering a C-category image MLC task with a training set \(\mathcal {D}=\bigl \{(I, \textbf{y})_i\bigr \}\). Each sample \((I, \textbf{y})\) consists of an image I and a label vector \(\textbf{y}=[y_1,...,y_C]\in \left\{ -1, 1, 0\right\} ^{C}\) where the \(c^\text {th}\) (\(c\in \{1, ..., C\}\)) element \(y_c\) is the label on category c and it is assigned to be either \(-1\) (negative), 1 (positive), or 0 (unknown). A deep neural network (typically CNN) Baseline has been trained on the training set \(\mathcal {D}\) with Negative mode. The architecture of Baseline consists of: (1) a backbone \(\texttt{b}\) transforms an input image I to a feature vector \(\textbf{z}=\texttt{b}(I)\in \mathbb {R}^Z\); and (2) a C-unit fully-connected layer \(\texttt{h}\) with Sigmoid activation transforms a feature vector \(\textbf{z}\) to an output vector \(\hat{\textbf{y}}=\texttt{h}(\textbf{z})=[\hat{y}_1, ..., \hat{y}_C]\in [0, 1]^C\), where the \(c^\text {th}\) element \(\hat{y}_c\) is the output representing the binary classification result on category c. To better illustrate CFT, we equivalently regard the fully-connected layer \(\texttt{h}\) as C independent logistic regressions (LRs) \(\texttt{h}_1, ..., \texttt{h}_C\), as shown in Fig. 1 left. The \(c^\text {th}\) LR \(\texttt{h}_c\) transforms a feature vector \(\textbf{z}\) to an output \(\hat{y}_c=\texttt{h}_c(\textbf{z})\).

3.1 Category-Wise Fine-Tuning (CFT)

The proposed CFT is a post-training method that can be applied to Baseline. CFT uses Ignore mode to one-by-one fine-tune the LRs \(\texttt{h}_1, ..., \texttt{h}_C\) to improve its performance on each category independently. Therefore, the backbone \(\texttt{b}\) is always unchanged.

Specifically, the procedure of CFT has C steps (i.e., determined by the number of categories C). The goal of the \(c^\text {th}\) step (\(c=\{1, ..., C\}\)) is to independently improve the performance on category c through fine-tuning Baseline. That is, the fine-tuning only improves the performance on category c, meanwhile, keeping the performance on other categories unchanged. Hence, each category can be independently improved without any concerns of harming other categories.

To achieve this goal, at the \(c^\text {th}\) step, only the \(c^\text {th}\) LR \(\texttt{h}_c\) is fine-tuned instead of the whole Baseline. It is because changing all the parameters of Baseline will change the performance on all categories, which does not match the goal. On the other hand, changing the parameters of \(\texttt{h}_c\) only affects the output \(\hat{y}_c\) on category c and does not affect the outputs on other categories, which matches the goal.

Fig. 1.
figure 1

The overview of CFT and the greedy selection.

At the \(c^\text {th}\) step, the \(c^\text {th}\) LR \(\texttt{h}_c\) is fine-tuned using binary crossentropy (BCE) loss with backpropagation (BP), which is popular for optimizing binary classification models. Ignore mode is used to reduce the performance decrease caused by the wrong labels of Negative mode during training. Particularly, \(\texttt{h}_c\) is fine-tuned on a new training set \(\mathcal {D}_c\) generated from the original training set \(\mathcal {D}\) for the use of Ignore mode and reducing computation cost, as shown in Fig. 1 right. We first select the samples from \(\mathcal {D}\) where the label on category c is known (i.e., \(y_c\in \{-1, 1\}\)) to be the samples in \(\mathcal {D}_c\). This selection ensures \(\texttt{h}_c\) is fine-tuned with Ignore mode. Then, as the backbone \(\texttt{b}\) is always the same, we convert the image I of each sample to a feature vector \(\textbf{z}=\texttt{b}(I)\) in advance to avoid unnecessary computation during fine-tuning. Lastly, the unnecessary labels on other categories are dropped. Formally, the new training set \(\mathcal {D}_c = \bigl \{(\textbf{z}, y_c)_{i}\bigr \}\) is generated by: \(\mathcal {D}_c = \bigl \{\texttt{T}((I, \textbf{y})) \big | (I, \textbf{y})\in \mathcal {D}, y_c\in \{-1, 1\}\bigr \}\) where \(\texttt{T}((I, \textbf{y}))=\bigl (\texttt{b}(I), y_c\bigr ) = (\textbf{z}, y_c)\).

3.2 Greedy Selection for Fine-Tuning Configuration Selection

While applying CFT to Baseline, as the LRs are independent to each other, each LR can be fine-tuned with different configurations to achieve higher performance. The configurations can be different optimization methods (e.g., BCE loss and the below-introduced GA), methods for handling the untypical labels that appear in the CheXpert dataset (see Sect. 4.1), batch sizes, learning rates, etc.

Hence, for each LR, we can additionally compare multiple fine-tuning configuration candidates and select the best one based on the results, referred to as greedy selection, as shown in Fig. 1 middle. For example, assume we apply CFT to Baseline that has 5 LRs \(\texttt{h}_1, ..., \texttt{h}_5\) (5 categories). We can additionally compare BCE loss and GA, then choose the best configuration for each LR. A possible result is, \(\texttt{h}_1, \texttt{h}_4, \texttt{h}_5\) uses BCE loss, while \(\texttt{h}_2, \texttt{h}_3\) uses GA.

3.3 Fine-Tuning Logistic Regressions (LRs) Using Genetic Algorithm

During the experiments on the CheXpert dataset (performance metric is AUC, higher is better), we found that fine-tuning an LR using BCE loss sometimes unwantedly decreases AUC. A concrete example is in Fig. 2 which shows the learning curves of fine-tuning the LR of the “Atelectasis” category. In both the training curves and the validation curves, minimizing BCE can cause AUC decreases. It is because minimizing BCE is generally used for optimizing classification accuracy [40], which does not necessarily achieve the best possible AUC [40] or AP (average precision) [33] that are popular metrics for image MLC.

Therefore, we propose using Genetic Algorithm (GA) [29] to fine-tune each LR. GA is a global search algorithm inspired by the principle of the evolution theory. In nature, individuals which are more adapted to the environment have higher chances to survive and produce offspring. This process keeps repeating over generations until the best individual is found.

GA has shown its feasibility for training neural networks [10, 18, 30] and has several advantages in comparison to BCE loss. (1) GA is a direct search method [37] that can directly improve the performance computed by a metric, which avoids the potential performance decreases caused by minimizing BCE; and (2) BCE loss relies on backpropagation which is easy to trap in local optima and difficult to escape it to find a better solution [18]. GA runs multiple solutions simultaneously, which helps to escape from local optima [37].

4 Experimental Results and Discussion

We conducted sufficient experiments on the CheXpert competition dataset (Sect. 4.1) and the partially labeled versions of the MS-COCO [28] standard MLC dataset (Sect. 4.2) to evaluate the effectiveness of the proposed methods.

4.1 The CheXpert Chest X-Ray Image MLC Competition Dataset

Dataset. CheXpert [21] is a large-scale chest X-ray image 14-category MLC competition dataset. The training set has 223,414 image samples. Labels are automatically extracted from the free text reports. Labels are either positive, negative, unknown (the term is blank in the original paper), or uncertain. Noteworthy, the uncertain labels in this dataset are untypical in partially labeled datasets and have different semantic meanings from unknown labels. An uncertain label captures both the uncertainty in diagnosis and ambiguity in the report, while an unknown label implies no mentions are found in the report. Hence, we do not simply consider the uncertain labels as unknown labels. We handle the uncertain labels in other ways instead, as described in the experimental settings below. The validation set has 234 image samples. A label is manually assigned as either positive or negative. The test set has 668 image samples. A label is manually assigned as either positive or negative. The test set is private and is reserved for the competition. Models must be submitted to the competition server for the official evaluation on the test set. The competition leaderboard is available at https://stanfordmlgroup.github.io/competitions/chexpert/. The official performance metric is used, which is computed by the mean AUC (mAUC) on the 5 categories: Atelectasis, Cardiomegaly, Consolidation, Edema, and Pleural Effusion.

Baseline Training. Baseline is a DenseNet-121 [19] CNN with an input resolution \(224^2\). The parameters trained on ImageNet [11] are used as the initial parameters. Baseline is trained on the training set for 10 epochs. We follow the previous state-of-the-art [31, 44] to treat unknown labels as negative (Negative mode) and treat uncertain labels as positive with label smoothing [31]. Images are rescaled to [0, 1]. We use the same data augmentation as in [6, 7]: horizontal flip, rotate \(\pm 20^{\circ }\), and scale \(\pm 3\%\). BCE loss with batch size 32 and Adam (\(lr=1\times 10^{-4}\)) [25] is used to update parameters. The checkpoint that achieves the highest validation mAUC is saved. Baseline achieves mAUC 89.6% on the validation set (as reported in Table 1) which is already very high for a single CNN. E.g., the single CNN of \(\text {2}^{\text {nd}}\) place on the competition leaderboard achieves mAUC 89.4% [31].

Ablation Study on CFT. We apply CFT to Baseline to improve its performance. The default BCE loss is used to fine-tune each LR, referred to as (CFT-BCE). Besides, we study two variants of CFT-BCE:

  1. 1.

    CFT-BCE-simu: All the LRs are fine-tuned simultaneously (i.e., fine-tune the whole classification layer), instead of fine-tuning each LR one-by-one. Partial-BCE loss [13] is used to enable Ignore mode.

  2. 2.

    CFT-BCE-Nega: Each LR is fine-tuned with Negative mode, instead of Ignore mode.

Full-batch gradient descent (\(lr=1\times 10^{-4}\)) is used to update parameters for stability. We treat the uncertain labels as unknown labels, so the uncertain labels are ignored in CFT. The number of epochs is 500.

Table 1 shows the results. CFT-BCE and its variants successfully improve the mAUC of Baseline. Particularly, CFT-BCE achieves the highest improvement (mAUC +0.3%). CFT-BCE-simu is less effective (+0.1%), because one-by-one fine-tuning allows individually saving the best checkpoint for each LR, thus achieving better mAUC. CFT-BCE-Negative is also less effective (+0.1%), demonstrating the use Ignore Mode can effectively reduce the performance decreases caused by the wrong labels of Negative mode during training.

Table 1. Ablation study on CFT, AUC%.
Table 2. Ablation study on GA, AUC%.

Ablation Study on GA. We study four different optimization methods for fine-tuning LRs to investigate the effectiveness of GA: (1) the default BCE loss used above (CFT-BCE), (2) GA (CFT-GA), (3) the loss proposed in [40], referred to as WMW loss (CFT-WMW), and (4) AUC margin loss (CFT-AUCM) [44]. WMW and AUC margin losses are particularly designed for AUC maximization.

For CFT-GA, we use the GA implementation of PyGAD [14]. The number of generations is 500. An individual represents the parameters of the LR, where one position of the individual represents one parameter. Decoding is the inverse operation of encoding. The number of individuals is 30. All individuals are initialized by encoding the original parameters. The fitness function is set to be the training mAUC. Roulette wheel selection is used to select 14 individuals as parents. 10 of the parents are additionally kept as individuals in the next generation. 2-point crossover is used with a probability of 80%. Mutation probability is set to be 2%. When a mutation occurs, \(1\%\) of the positions are mutated by adding random scalars drawn from \([-0.02, 0.02]\). The individual that attains the highest fitness score at every generation is validated instead of all individuals to reduce the risk of overfitting. The individual that achieves the highest validation mAUC is decoded and saved. For CFT-WMW, stochastic gradient descent (\(lr=1\times 10^{-3}, momentum=0.9\)) with batch size 32768 is used due to memory lack. For CFT-AUCM , we follow the original paper [44] to use PESG (\(lr=1\times 10^{-2}, margin=1\)) [16]. Full batch size is used.

Table 2 shows all methods successfully improve the AUCs on all 5 categories. Particularly, GA is the most effective (mAUC +1.9%), followed by AUCM loss (+1.6%), WMW loss (+1.4%). BCE loss is the least effective (+0.3%).

Although WMW and AUCM losses are designed for AUC maximization, they are less effective than GA, probably they rely on backpropagation which is easy to trap on local optima. On the other hand, GA can directly optimize AUC and is easier to escape from local optima. BCE loss is the least effective, as minimizing BCE can lead to AUC drops. E.g., on “Atelectasis” category (Fig. 2).

Fig. 2.
figure 2

Learning curves of using BCE loss to fine-tune the LR on Atelectasis. Minimizing BCE loss can decrease AUC.

Table 3. Greedy selection for exploiting uncertain labels, AUC%.

Greedy Selection for Exploiting Uncertain Labels. In the above ablation studies, treating uncertain labels as unknown may be sub-optimal, as previous studies in this dataset show that treating uncertain labels as positive tends to achieve higher performance [31]. Therefore, we compare three methods for handling uncertain labels with CFT-GA: treat as unknown labels (same as in ablation studies), positive labels [21], and negative labels [21].

Table 3 shows that different categories prefer different methods. Hence, we use the greedy selection to select the best method for each LR, eventually achieving mAUC 91.8%, which is +2.2% higher than Baseline mAUC 89.6%. In the following comparison section, we refer to this model as CFT-GA-Greedy.

Table 4. Comparison to other state-of-the-art approaches on the test set, AUC%.

Comparison to State-of-the-art Approaches. We compare CFT-GA-Greedy to other state-of-the-art approaches on the test set. Most approaches treat unknown labels as negative labels, hence can be considered as strong baselines of Negative mode for the comparison. Table 4 shows the comparison.

Single Model. We submitted CFT-GA-Greedy to the competition server for official evaluation. It achieves mAUC 91.82% which is the highest single model AUC in the leaderboard and literature, to the best of our knowledge.

Ensemble. We build an ensemble composed of CFT-GA-Greedy and another 4 CNNs developed by our proposed methods, referred to as CFT-GA-Greedy-Ensemble. Similar to \(2^\text {nd}\) on the competition leaderboard [31], we use test time augmentation [36] for more robust predictions: scale \(\pm 5\%\), rotate \(\pm 5^{\circ }\), translate \(\pm 5^\circ \). Since the competition was suddenly closed, our ensemble cannot be submitted for the official evaluation. After the test set was released and can be downloaded, we evaluate our ensemble on a local machine. Our ensemble achieves mAUC 93.33% which superiors the best in the leaderboard and literature, to the best of our knowledge.

4.2 Partially Labeled Versions of MS-COCO

Dataset. MS-COCO [28] (2014 split) is a standard MLC dataset comprising 80 categories. The training and the validation sets consist of around 80k and 40k image samples, respectively. We follow the work on MS-COCO (e.g., [34]) to use mean AP (mAP) as the performance metric.

As the training data is fully labeled, different schemes of partial labels can be simulated by dropping some labels. Particularly, we study our methods under the proportions of known labels of 10%, 20%, ..., 90%, respectively. To simulate these schemes, we randomly drop 90%, 80%, ..., 10% of labels, respectively.

Table 5. Results on partially labeled versions of MS-COCO dataset. In mAP %. “Average” column is the average mAP over label proportions 10% to 90%. (Bolded is the best, underlined is the \(2^\text {nd}\) best)

Baseline Training. We follow most of the settings of [34] to train Baseline, as they achieved state-of-the-art CNN on the original MS-COCO (i.e., fully labeled). Baseline is a TResNet-L [35] with an input resolution \(448^2\) . The parameters trained on ImageNet are used as the initial parameters. Negative mode is used to handle the unknown labels. We use batch size 8, asymmetric loss [34], and Adam (\(lr=2\times 10^{-4}\)) to update the parameters. We use AutoAugment [9] with pretrained ImageNet policy as the data augmentation method. Normalization of mean 0 and variance 1 is applied to the input images. The checkpoint that achieves the highest validation mAP is saved. The performance of Baseline under different label proportions are reported in Table 5.

Ablation Study on CFT. We apply CFT to Baseline to improve its performance. The default BCE loss is used to fine-tune each LR, referred to as CFT-BCE. Similar to the experiments on CheXpert, we also study the two variants of CFT-BCE: CFT-BCE-simu and CFT-BCE-Negative. Full-batch gradient descent (\(lr=1\times 10^{-2}, monentum=0.9\)) is used and the number of epochs is 5000.

CFT-BCE improves the average mAP by 2.74%, CFT-BCE-simu improves 0.36%, and CFT-BCE-Negative improves 1.27%. Both variants are less effective than CFT-BCE, demonstrating the effectiveness of one-by-one fine-tuning and Ignore mode.

Noteworthy, CFT-BCE-Negative does not use Ignore mode. Although it is less effective than using Ignore mode, it still can improve the average mAP. It implies that this improvement is likely to be gained from somewhere else instead of from reducing the performance decreases caused by the wrong labels of Negative mode during training. Therefore, CFT may be able to improve models trained with fully labeled data, which requires further investigation.

Ablation Study on GA. We compare GA to the default BCE loss (used in above) for fine-tuning each LR. The number of generations is 2000. The population size is 50. All the individuals of the initial population are encoded from the original parameters. The best individual of the current generation is selected as one individual of the next generation. The parents are selected using roulette wheel selection. During crossover, 20% of the positions of two parents are randomly switched to produce offspring. Each offspring has a 50% chance of being mutated by adding a random scalar between \([-0.001, 0.001]\) to each position.

GA improves the average mAP by 1.68%. However, it is generally less effective than BCE loss (2.74%). The key reasons may be (1) minimizing BCE does not necessarily lead to AP drops, and (2) BCE loss relies on backpropagation which is generally more efficient than GA.

Greedy Selection. We use greedy selection for choosing the best optimization methods between BCE loss and GA for each LR, referred to as CFT-Greedy. CFT-Greedy improves the average mAP by 2.77%, which is further higher than CFT-BCE by 0.03%. It implies that the greedy selection has chosen GA for the fine-tuning of a small proportion of LRs.

5 Conclusion

In this paper, we propose a new post-training method called CFT which one-by-one fine-tunes the LRs in a model trained with Negative mode to improve its classification performance of each category independently further. Two optimization methods (BCE loss and GA) are tested for fine-tuning LRs. The effectiveness is evaluated on the CheXpert competition dataset and the partially labeled versions of the MS-COCO standard MLC dataset. Especially, CFT achieves state-of-the-art on the CheXpert dataset (single model AUC 91.82% and ensemble AUC 93.33%, on the test set).