Keywords

1 Introduction

The task of concurrently detecting and categorizing objects has been extensively studied in classic computer vision [5, 12]. In medical image computing, numerous approaches have been proposed to predict lesion locations and gradings, most of them in a supervised manner utilizing manual annotations. However, when adopting state-of-the-art object detectors for end-to-end lesion grading, one has to account for an inherent difference in the data: The grading of lesions denotes a subjective discretization of naturally continuous and ordered features (such as scale or intensity) to semantic categories with clinical meaning (e.g., BI-RADS score, Gleason score [14], PI-RADS score, TNM staging). This is in contrast to typical tasks on natural images, where categories can be described as an unordered set (no natural ordinal relation exists between dogs and cars). Hence, current object detectors phrase the categorization as a classification task and are trained using the cross-entropy loss, not considering the continuous ordinal relation between classes (see Sect. 2.1).

In this paper, we account for the ordinal information in lesion appearance and derived categories, aiming to improve model performance. To this end, we propose Reg R-CNN, which replaces the classification model of Mask R-CNN [5], a state-of-the-art object detector, with a regression model. Regression models utilize distance metrics, i.e., models are trained directly on the underlying continuous scale, which has the following major benefit in the setting of lesion grading on medical images:

Medical data sets often exhibit high ambiguity that is reflected in the variability of the human annotations. Under the assumption that class confusions follow a distribution around the underlying ground truth, distance metrics used in regression such as the L1-distance are more tolerant to mild deviation from the target value as opposed to the categorical cross entropy which penalizes all off-target predictions in equal measure [4].

We empirically show the superiority of Reg R-CNN on a public data set with 1026 patients and a series of toy experiments with code made publicly available.

2 Methods

2.1 Regression vs. Classification Training

In order to see why we expect the training of regression models to be more robust to label noise than classification models for the case when target classes lie on a continuous scale, let us first revisit the objective commonly minimized by classifiers. This objective is the cross entropy (CE), defined as

$$\begin{aligned} H\left( \mathbf p , \mathbf q ; \mathbf X \right) = - \sum _{j} p_j(\mathbf X ) \log q_j(\mathbf X ) \end{aligned}$$
(1)

between a target distribution \(\mathbf p (\mathbf X )\) over discrete labels \(j \in C\) and the predicted distribution \(\mathbf q (\mathbf X )\) given data \(\mathbf X \). For mutually exclusive classes, the target distribution is given by a delta distribution \(\mathbf p (\mathbf X )=\{\delta _{ij}\}_{j\in C}\).

To produce a prediction \(\mathbf q (\mathbf X )\), the network’s logits \(\mathbf z (\mathbf X )\) are squashed by means of a softmax function:

$$\begin{aligned} \mathbf q (\mathbf X ) = \frac{e^\mathbf{z (\mathbf X )}}{\sum _{k\in C} e^{z_k(\mathbf X )}}, \end{aligned}$$
(2)

which, plugged into Eq. 1 and given the target class i, leads to the loss term

$$\begin{aligned} H = \mathcal{L}_{CE}(\mathbf p = \delta _{ij}, \mathbf q ;\mathbf X ) = -z_i + log \textstyle \sum _k e^{z_k}. \end{aligned}$$
(3)

From Eq. 3 it is apparent that the standard CE loss treats labels as an unordered bag of targets, where all off-target classes (\(j \ne i\)) are penalized in equal measure, regardless of their proximity to the target class i. Distance metrics on the other hand, as their name suggests, take into account the distance of a prediction to the target. This lets the loss scale in the deviation of prediction to target. Allowing to be more accepting of mild discrepancies, it better accommodates for noise from potentially conflicting labels in settings where the target labels lie on a continuum.

In the range of experiments below, we compare classification against regression setups, for which we employed the smooth L1 loss [6] given by

(4)

for predicted value p and target value t. Other works have investigated adaptions to the CE loss to account for noisy labels in classification tasks, e.g. [15, 17]. Our approach is complementary to those works as it exploits label continua on medical images.

2.2 Reg R-CNN and Baseline

The proposed Reg R-CNN architecture is based on Mask R-CNN [5], a state-of-the-art two-stage detector. In Mask R-CNN, first, objects are discriminated from background irrespective of class, accompanied by bounding-box regression to generate region proposals of variable sizes. Second, proposals are resampled to a fixed-sized grid and fed through three head networks: A classifier for categorization, a second bounding-box regressor for refinement of coordinates, and a fully convolutional head producing output segmentations (the latter are not further used in this study except for the additional pixel-wise loss during training). Reg R-CNN (see Fig. 1) simply replaces the classification head by a regression head, which is trained with the smooth L1 loss instead of the cross-entropy loss (see Sect. 2.1).

For the final filtering of output predictions, non-maximum suppression (NMS) is performed based on detection-confidence scores. In Mask R-CNN, these are provided by the classification head. Since the regression head does not produce confidences, we use the objectness scores from the first stage instead.

In this study, we compare Reg R-CNN against Mask R-CNN as the classification counterpart of our approach. Only minor changes are made with respect to the original publication [5]: The number of feature maps in the region proposal network is lowered to 64 to account for GPU memory constraints. The poolsize of 3D RoIAlign (a 3D re-implementation of the resampling method used to create fixed-sized proposals) is set to (7, 7, 3) for the classification head and (14, 14, 5) for the mask head. The matching Intersection over Union (IoU) for positive proposals is lowered to 0.3. Objectness scores are used for the final NMS to reflect the desired disentanglement of detection and categorization tasks.

Note that all changes apply to Reg R-CNN as well, such that the only difference between the models is the exchange of the classification head with a regression head.

Fig. 1.
figure 1

Reg R-CNN for joint detection and grading of objects. The architecture is closely related to Mask R-CNN [5], where grading is done with a classification head instead of the displayed “Score Regressor” head network. FPN denotes the feature pyramid network [11], RPN denotes the region proposal network and RoIAlign is the operation which resamples object proposals to a fixed-sized grid before categorization.

2.3 Evaluation

Comparing the performance of regression to classification models requires taking into account additional considerations since both are trained along an upstream detection task.

In order to compare continuous regression and discrete classification outputs, we bin the continuous regression output after training, such that bin centers match the discrete classification targets.

What’s more, the joint task of object detection and categorization is commonly evaluated using average precision (AP) [2]. However, AP requires per-category confidence scores, which are, as mentioned before, not provided by regression outputs. Instead, we borrow a metric commonly used in viewpoint estimation, the Average Viewpoint Precision (AVP) [16]. Based on AVP, we phrase the lesion scoring as an additional task on top of foreground vs. background object detection: In order for a box prediction to be considered a true positive, it needs to match the ground-truth box with an IoU > 0.1Footnote 1, and additionally the malignancy prediction score is required to lie in the correct category bin. This way, AVP simultaneously measures both the detection and malignancy-scoring performance of the models. We additionally disentangle the task performances and separately report the AP of foreground vs. background detection (this poses an upper bound on AVP) and the bin accuracy. The latter is determined by selecting only true positive predictions according to the detection metric and counting malignancy-score matches with the target bin.

Fig. 2.
figure 2

(a) A confusion-matrix-style display of annotator dissent in the LIDC data set. Rows represent the binned mean ratings of lesions (in place of the true class in a standard confusion matrix), columns the ratings of the corresponding single annotators. “MS” means malignancy score. Matrix is row-wisely normalized, hence cell values indicate distribution of lesion ratings within a bin. (b)–(d) Example slice from the LIDC data set showing GT, Reg R-CNN, and Mask R-CNN prediction separately. GT note “sa. MS” shows the single-annotator grades (grade 0 means no finding), “agg. MS” their mean. In the predictions, “FG” means foreground confidence (objectness score), “MS” denotes the predicted malignancy score. Mask R-CNN MS can be non-integer due to Weighted Box Clustering [7]. Color symbolizes bin.

3 Experiments

3.1 Utilized Data Sets

Lung CT Data Set. The utilized LIDC-IDRI data set consists of 1026 patients with annotations of four medical experts each [1]. Having disposable multiple gradings from distinct annotators is a rare exception on medical images and allows to investigate the exhibited label noise [9].

Full agreement, which we define as all raters assigning the same malignancy label to all lesions (RoIs) in a patient, is observed on a mere 163 patients (this includes patients void of findings by all raters). This corresponds to a rater disagreement with respect to the malignancy scoring on \(84\%\) of the patients. On a lesion level (RoI-wise), the data set comprises 2631 lesions when considering all lesions with a positive label by at least one rater. This number drops to 1834, 1333, or 821, when requiring 2, 3, or 4 positive labels respectively. This shows that this data set’s labelling is both ambiguous with respect to whether or not a lesion is present as well as the prospective lesion’s grading. The first ambiguity type has bearings on the detection head’s performance, while the second type influences the network’s classification or, respectively, regression head.

In order to evaluate the grading performance, the following malignancy statistics include only patients with at least one finding. Among those, we count 99 lesions (\(3.8\%\) of all lesions) with full rater agreement, leaving disagreement on 2532 (or \(96.2\%\)). The standard deviation of the 4 graders averaged over all lesions amounts to 1.05 malignancy-score values (ms). In Fig. 2(a), we show how the single graders’ malignancy ratings differ given the binned mean rating. The figure reveals significant label confusion across adjacent labels and even beyond. Figures 2(b)–(d) display example Reg and Mask R-CNN predictions next to the corresponding ground truth.

In order to investigate the models’ performance under label noise, we randomly sample a malignancy score (MS) for a given lesion from the 4 given gradings at each training iteration. At test time, we however employ the lesions’ mean malignancy score as the ground truth label, which allows to evaluate against a ground truth of reduced noise.

Toy Data Set. To analyze the performance of Reg R-CNN vs. Mask R-CNN on an artificial data set with label noise on a continuous scale, we designed a set of 3D toy images. The associated task is the joint detection and categorization of cylinders, where five categories are distinguished as cylinders of five different radii. In order to simulate label confusion, Gaussian noise is added to the isotropic target radii during training, sampled with standard deviation \(\sigma =\nicefrac {r}{6}\) around object radius r, as depicted in Figs. 3(c) and (d). This causes targets (especially of large-radius objects) to be shifted into wrong, yet mostly adjacent target bins. Figure 3(a) portrays that these ambiguities are imprinted on the images as a belt of reduced intensity with width \(2\sigma \) around the actual radius. At test time, model predictions are evaluated against the exact target radii without noise. The data set consists of 1.5k randomly generated samples for training and validation, as well as a hold-out test set of 1k images.

3.2 Training and Inference Setup

Both the LIDC and the toy data set consist of volumetric images. In this study, we evaluate models both in 3D as well as 2D (slice-wise processing). For the sake of comparability, all methods are implemented in a single framework and run with identical hyperparameters. Networks are trained on patch crops of sizes \(160\times 160\times 96\) (LIDC) and \(320\times 320\times 8\) (toy), oversampling of foreground regions is applied. Class imbalances in object-level classification losses are accounted for by stochastically mining the hardest negative object candidates according to softmax probability.

On LIDC, models are trained for 130 epochs, each composed of 200 batches with size 8 (20) in 3D (2D) using the Adam optimizer [8] with default settings at a learning rate of \(10^{-4}\). Training is performed as a five-fold cross validation (splits: train \(60\%\)/val \(20\%\)/test \(20\%\)). At test time, we ensemble the four best performing models according to validation metrics over four test-time views (three mirroring augmentations) in each fold. Aggregation of box predictions from ensemble members is done via clustering and weighted averaging of scores and coordinates. Predictions from 2D models are consolidated along the z-axis by means of an adaption of NMS and evaluated against the 3D ground truth.

Fig. 3.
figure 3

(a) Cylinders (2D projections) of all five categories (r1-r5) in the toy experiment. (b) Exact GT. (c) Examples of a noisy GT for each category (r1-r5). \(r_a\) indicates the annotated radius (target regression value). (d) Gaussian sampling distributions used to generate the noisy GT. Green vertical lines depict the exact ground-truth values, while blue lines are the corresponding label-noise distributions. Green rectangles are the bins (borders enlarged for illustration) used for training of the classifier as well as for evaluation of both methods. Note that distributions reach into neighboring bins leading to label confusions. (Color figure online)

3.3 Results and Discussion

Results are shown in Table 1. In addition to the fold means of the metrics, we report the corresponding standard deviations. On LIDC, Reg R-CNN outperforms Mask R-CNN on both input dimensions and all three considered metrics. On the toy data set, Reg R-CNN shows superior performance in \(\mathrm {AVP_{10}}\) and Bin Accuracy. \(\mathrm {AP_{10}}\) reaches \(100\%\) in both models indicating that the detection task is solved entirely, i.e., the object grading task has been isolated successfully (hence, results for \(\mathrm {AVP_{10}}\) converge towards the Bin Accuracy). All experiments demonstrate the superiority of distance losses in the supervision of models performing continuous and ordered grading under noisy labels. Interestingly, there is a marked increase in performance for both setups when running in 3D as opposed to 2D, suggesting that additional 3D context is generally beneficial for the task.

Table 1. Results for LIDC and the toy data set. \(\mathrm {AVP_{10}}\) measures joint detection and categorization performance, while \(\mathrm {AP_{10}}\) measures the disentangled detection performance and Bin Accuracy shows categorization performance (conditioned on detection, see Sect. 2.3)

4 Conclusion

Simultaneously detecting and grading objects is a common and clinically highly relevant task in medical image analysis. As opposed to natural images, where object categories are mostly well defined, the categorizations of interest for clinically relevant findings commonly leave room for interpretation. This ambiguity can bear on machine-learning models in the form of noisy labels, which may hamper the performance of classification models. Clinical label categories however often reside on a continuous and ordered scale, suggesting that label confusions are likely more frequent between adjacent categories.

For this case, we show that both the performance of lesion detection and malignancy grading can be improved upon over a state-of-the-art detection model when simply trading its classification for a regression head and altering the loss accordingly. We document the success of the ensuing model Reg R-CNN on a large lung CT data set and on a toy data set that induces artificial ambiguity. We attribute the edge in performance to the loss formulation of the regression task, which naturally accounts for the continuous relation between labels and is therefore less prone to suffer from conflicting gradients from noisy labels.

5 Outlook

As Eq. 4 shows, we employ a metric approach to ordinal data. In general, this is not hazard-free as model performance may suffer from the imposed metric if the scale actually is non-metric [10]. In other words, our approach implicitly assumes the grading scale has sufficiently metric-like properties. To address this limitation, we plan to study alternative non-metric approaches in future work [3, 13].