Introduction

There are over 550,000 head and neck (HaN) cancer patients worldwide every year, with around 300,000 deaths [1]. Radiotherapy is one of essential treatments for them. In radiotherapy, it is necessary to accurately delineate the HaN organs to control the radiation dose distribution and lessen the damage to normal tissues and organs. Professional doctors’ manual delineation of HaN organs is inefficient, and segmentation results depend on their professional experience. Some traditional segmentation algorithms are challenging to achieve multi-organ segmentation simultaneously.

In this paper, we propose a novel model based on 3D U-Net [2] to improve the accuracy of multi-scale organs segmentation. It introduces squeeze-and-attention (SA) [3] blocks into residual blocks to gather multi-scale context information and group non-local voxels from the same organ. It also employs down-sampling only once and introduces receptive field block (RFB) [4] to balance the performance of large-sized organs and small-sized organs. Furthermore, we choose marginal loss function and exclusion loss function to train the model in partially supervised learning mode [5], which utilizes the prior knowledge among voxels to improve the generalization performance.

Related work

Methods based on 3D CNNs implement end-to-end automatic segmentation in the task of HaN OARs [6]. General 3D CNN, however, is not easy to solve imbalanced segmentation accuracy caused by excessive volume difference in OARs. Researchers improve 3D CNN to solve this problem by a variety of methods [7]. CNN with self-channel-and-spatial-attention mechanism adaptively forces the network to emphasize the meaningful features and weaken irrelevant features [8]. AnatomyNet [9] introduces 3D squeeze-and-excitation (SE) [10] blocks into 3D U-Net to enhance the feature extraction ability. It chooses dice loss function [11] and focal loss function [12] to reduce the highly imbalanced segmentation accuracy caused by the various size of OARs. The cascade of multiple models or structures is widely used in multi-target segmentation tasks, such as the cascade of two 3D U-Net models, which locates OARs from the CT image and then implements fine segmentation to obtain better results [13]. FocusNet [14] simulates the process of doctors’ delineation, combining the small-sized organs segmentation, small-sized organs location, and main segmentation network. The prior knowledge of OARs’ shape is sometimes applied in the segmentation of multi-scale organs to improve accuracy. FocusNetv2 [15] adds the adversarial shape constraint block to regularize the estimated mask, making the segmentation results consistent with the shape of small-sized organs. Imaging characteristics of multi-modality images are also utilized in segmentation tasks to improve accuracy. Liu et al. [16] exploit synthetic magnetic resonance (MR) to aid training dual pyramid network (DPN) [17]. Dai et al. [18] utilize the complementary information of cone-beam CT (CBCT) images and MR images to improve the segmentation performance.

Data

The dataset scale is vital for image segmentation based on supervised learning; therefore, we use 3 public datasets to train our model. The public domain database for computational anatomy (PDDCA) dataset [19] contains 25 training samples, additional 8 training samples added after the MICCAI 2015 Head and Neck Auto Segmentation Challenge (MICCAI 2015), 10 offsite test samples, and 5 onsite test samples. It contains the whole-volume CT images of HaN and binary labels, including the brainstem, mandible, chiasm, optic nerve left (Optic. L), optic nerve right (Optic. R), parotid gland left (Paro. L), parotid gland right (Paro. R), submandibular gland left (Subm. L), and submandibular gland right (Subm. R). According to the data processing method provided by AnatomyNet, we expanded the training dataset by the public available dataset of head and neck cetuximab [20] (46 samples) and institutions in Québec, Canada [21] (177 samples). Finally, the training dataset includes 261 samples, and the test dataset includes 10 offsite test samples.

The PDDCA test dataset contains 9 annotated labels, but the expanded training dataset does not include all. To maintain dataset consistency, we cropped the original CT images to the same size, retaining the basic organs information, and resampled them to 3 mm \(\times \) 1.2 mm \(\times \) 1.2 mm.

Method

Squeeze-and-attention block

The principle of pixel-by-pixel prediction is usually applied in semantic segmentation tasks based on CNN. We introduce the pixel grouping mechanism implemented by SA block to improve the segmentation performance. Figure 1 illustrates its structure. The average pooling layer gathers non-local spatial attention of feature maps in SA block by increasing the receptive field, encoding global features, and grouping non-local voxels from the same organ. Its average pooling layer, of which kernel size and stride are both 2, scales the size of its feature map to 1/8 of the original size. Next, two successive convolution blocks with kernel size of 3 and stride of 1 extract its feature map information. Then, the up-sampling layer recovers the size of feature map. Finally, the residual connection fuses the local and non-local information.

Fig. 1
figure 1

The architecture of the SA block

Receptive field block

The down-sampling operation increases the receptive field of the model but loses some details. Therefore, we employ down-sampling only once and introduce RFB [4], which increases the receptive field and balances segmentation accuracy between large-sized and small-sized organs. Figure 2 illustrates the structure of RFB based on the inception [22] block, which is improved by the atrous convolution layers [23] to extract multi-scale features. Three branches calculate the input feature map to extract multi-scale features, and each branch comprises convolutions with kernel sizes of \(1\times 1\times 1\), \(3\times 3\times 3\), or \(5\times 5\times 5\), followed by an atrous convolution with the rate of 1, 3, or 5 to increase the receptive field. In addition, the shortcut connection fuses the branches’ concatenation features and input features.

Fig. 2
figure 2

The architecture of RFB, the number of channels is marked above each block

Network architecture

Based on the 3D U-Net, we propose an HaN segmentation network as illustrated in Fig. 3; X and \(X^{\prime }\) denote input image and segmentation result, respectively. Its encoder and decoder have three SA blocks that classify and group voxels from the same organ. The down-sampling increases the receptive field but loses some details of the feature map and reduces the accuracy of small-sized organs. Therefore, we only employ down-sampling and up-sampling once in the model. In addition, we introduce an RFB to balance segmentation accuracy between large-sized and small-sized organs and to learn multi-scale features, which increases the receptive field and improves the segmentation accuracy of small-sized organs.

Fig. 3
figure 3

The architecture of model, the number of channels is marked above or below its blocks

Loss function

The MICCAI 2015 dataset contains 9 annotated labels for each sample, but other datasets contain fewer labels. To deal with various labels of training datasets, we introduce a vector denoted by M(c) (\(M \in R^{1\times 10}\), \(c=0\) denotes the index of background information for CT images, \(c \in [1,9]\) denotes the index of 9 OARs in HaN) to mark missed annotated data, which indicates the presence of a class in training sample with 1 or 0 otherwise. The output of the last convolution layer is the prediction probability of voxels, denoted by P, of which the dimension is \(N_c\times H\times W\times D\), where \(N_c\) represents its channels, corresponding to the type of OARs, and \(H\times W\times D\) represents the size of CT image.

The marginal probability fuses the probability of unlabeled organs and background to drive the model to learn information of unlabeled organs, and it is formulated as Eq. (1).

$$\begin{aligned} P_{M} = \sum _{c=0}^{9}P(c\mid M(c)=0) \end{aligned}$$
(1)

where \( P_M\) and c denote the marginal probability and the type of organs, respectively; \(M(c)=0\) represents the organ c is not annotated. The marginal probability and the annotated organs’ probability consist of a new vector; it is denoted by Q and formulated as Eq. (2).

$$\begin{aligned} Q = \left[ P_{M} \quad P(c\mid M(c)=1)\right] \end{aligned}$$
(2)

The binary mask data of annotated organs in the training dataset are denoted by Y, on which one-hot encoding is performed and formulated as Eq. (3).

$$\begin{aligned} Z = \mathrm{onehot}(Y) \end{aligned}$$
(3)

The dice loss function and the focal loss function are plugged into the marginal loss to solve imbalanced segmentation accuracy caused by huge volume differences, formulated in Eqs. (4)–(9).

$$\begin{aligned}&\mathrm{TP}_m(t) = \sum _{n=1}^{N}Z_{n}(t)Q_{n}(t) \end{aligned}$$
(4)
$$\begin{aligned}&\mathrm{FN}_m(t) = \sum _{n=1}^{N}Z_{n}(t)(1-Q_{n}(t)) \end{aligned}$$
(5)
$$\begin{aligned}&\mathrm{FP}_m(t) = \sum _{n=1}^{N}Q_{n}(t)(1-Z_{n}(t)) \end{aligned}$$
(6)
$$\begin{aligned}&L_{\mathrm{mDice}} = T - 2 \sum _{t=0}^{T}\frac{\mathrm{TP}_m(t)}{\mathrm{TP}_m(t)+\alpha \mathrm{FN}_m(t)+\beta \mathrm{FP}_m(t)} \end{aligned}$$
(7)
$$\begin{aligned}&L_{\mathrm{mFocal}} = -\lambda \frac{1}{N}\sum _{t=0}^{T}\sum _{n=1}^{N}Z_{n}(t)(1-Q_{n}(t))^2 \log (Q_{n}(t)) \nonumber \\ \end{aligned}$$
(8)
$$\begin{aligned}&L_{m} = L_{\mathrm{mDice}} + L_{\mathrm{mFocal}} \end{aligned}$$
(9)

where \(\mathrm{TP}_m(t)\), \(\mathrm{FP}_m(t)\), and \(\mathrm{FN}_m(t)\) denote true positive, false positive, and false negative of marginal probability for the organ t, respectively. \(Q_{n}(t)\) denotes the marginal probability of voxel n for organ t, and \(Z_{n}(t)\) denotes the one-hot encode of voxel n for organ t. T and N denote the total number of annotated organs and voxels for one sample, respectively, and C denotes the total number of OARs, which is 9 in our task. \(L_{\mathrm{mDice}}\), \(L_{\mathrm{mFocal}}\), and \(L_m\) denote dice loss function, focal loss function, and marginal loss function, respectively, where \(\lambda \) trades off the dice loss function and focal loss function; \(\alpha \) and \(\beta \) trade off weights for false negative and false positive. For the best performance, \(\lambda \) is set to 0.2; \(\alpha \) and \(\beta \) are set to 0.5.

The exclusion vector exploits the principle that each voxel can only belong to one organ and mutual exclusion of voxels between OARs. Its calculation is negated to the one-hot vector of the binary mask and formulated in Eq. (10).

$$\begin{aligned} E(c)={\left\{ \begin{array}{ll}1-Z(0) &{} \mathrm{if}\, M(c)=0 \\ 1-Z(c) &{} \mathrm{otherwise} \\ \end{array}\right. } \end{aligned}$$
(10)

The exclusion vector is denoted by E(c), of which the dimension is \(N_c \times H \times W \times D\). In this paper, the dice loss function is plugged into the exclusion loss, denoted by \(L_{\mathrm{eDice}}\), where \(P_{1n}(c)\) and \(P_{0n}(c)\) indicate the probability that voxel n is predicted to be organ c or not be organ c, respectively; \(E_{0n}(c)\) and \(E_{1n}(c)\) indicate that the value of exclusion vector represented by the voxel n of organ c is 0 or 1, respectively. \(\alpha \) and \(\beta \) are set to 0.5 for the best performance. The exclusion loss is formulated as Eq. (11).

$$\begin{aligned} L_{\mathrm{eDice}} = \sum _{c=0}^{C} \frac{\sum _{n=1}^{N}P_{1n}(c)E_{1n}(c)}{\sum _{n=1}^{N}P_{1n}(c)E_{1n}(c) + \alpha \sum _{n=1}^{N}P_{0n}(c)E_{1n}(c) + \beta \sum _{n=1}^{N}P_{1n}(c)E_{0n}(c) } \end{aligned}$$
(11)

Verified by many experiments, the model gets the best performance when the weight of exclusion loss is 2. Finally, the loss function denoted by \(L_{\mathrm{loss}}\) is formulated in Eq. (12).

$$\begin{aligned} L_{\mathrm{loss}} = L_{m} + 2 \times L_{\mathrm{eDice}} \end{aligned}$$
(12)
Fig. 4
figure 4

Comparisons of the model performance with the original challenge dataset. (a, b) represent the DSC and 95HD of models, respectively; AC, TA, ET, and OUR represent results of Antong Chen et al. [25], Thomas Albrecht et al. [26], Tappeiner et al. [13], and our model, respectively. ET does not provide results of the submandibular gland left and right. In addition, we cropped the ET’s error bars of the brainstem and parotid gland left for visualization purposes, and their standard deviations are 14.3 and 33.3, respectively

Results

Implementation details and evaluation metrics

Experiments were run on the platform with NVIDIA RTX 2080Ti GPU and INTER I7-10700 CPU, and the model was implemented by PyTorch. The apex mixed precision released by the NVIDIA platform accelerated the training process and saved hardware resources. The RMSprop algorithm optimized the gradient of the loss function, of which the learning rate was 0.001, the number of epochs was 200, and the batchsize was 1 caused by the vector M in the loss function. Dice similarity coefficient (DSC), 95% Hausdorff distance (95HD) [19], and inference time were used to evaluate the performance.

Table 1 DSC comparisons with state-of-the-art methods

Experimental results and analysis

We compared the model’s DSC with previous state-of-the-art methods, as illustrated in Table 1. With the same training dataset, the DSC of our model is 4.5% higher than AnatomyNet’s [9], which also uses one down-sampling layer to avoid loss of information of small organs. The RFB expands the receptive field and addresses the conflict between receptive field and small organs. Moreover, the marginal loss function handles data without labels, and the exclusion loss function improves performance according to prior knowledge among voxels. Our model is also superior to the best results in MICCAI 2015 [19] (It just gives average DSC for symmetrical organs). Compared with nnU-net [24], it shows better performance of imbalanced organs, and nnU-net gets the poor performance of small-sized organs. Compared with the state-of-the-art models, it is close to the performance of FocusNetv2 [15] for large-sized organs and is slightly worse for small-sized organs, but FocusNetv2 is a larger model with more parameters and trained by more private data. In addition, FocusNet and FocusNev2 are not end-to-end models, trained separately by three sub-networks followed by a combined network to implement segmentation of OARs.

DSC is sensitive to internal details of organs, and 95HD is sensitive to boundaries. Experimental results in Table 2 demonstrate that our method is not better than FocusNetv2 [15] in the metric of 95HD and much better than others. Our model has better performance on boundaries of organs because \(L_\mathrm{eDice}\) employs mutual exclusive information among voxels of different organs. In addition, we performed Kruskal–Wallis test on the offsite test dataset. Their p-value and test statistic of DSC are 0.9998 and 2.0630, respectively, and their 95HD is 0.5293 and 12.9636, respectively.

Table 2 95HD comparisons with state-of-the-art methods (mm)
Table 3 Comparison of parameters and inference time of different models
Fig. 5
figure 5

Visualization results. (a) The cross-sectional view of prediction; (b) the cross-sectional view of ground truth; (c) the cross-sectional view of overlap between prediction and ground truth; (d) the 3D view of overlap between prediction and ground truth

To evaluate the performance of the model more credible, we trained our model with the original training samples (0522c001 to 0522c0328 of PDDCA) and tested on 15 samples, including offsite samples and 5 onsite samples. With the same dataset, we also compared our model with method of Tappeiner et al. [13] and some participants who provided full experimental results of MICCAI 2015 [25, 26], and Figure 4 illustrates their DSC score and 95HD.

We compared the number of parameters and the average inference time on the same hardware platform. Experimental results in Table 3 show that the parameters of our model are 60% less than FocusNetv2’s, and the inference time is 63% less, which means our model requires fewer hardware resources and less time. Compared to AnatomyNet, our model has higher accuracy with the same order of magnitude of inference time and the model’s parameters.

Visualization

Figure 5 illustrates the visualization of segmentation results, of which the legend shows the correspondence between colors and organs. In the cross-sectional view, predicted contours match ground truth quite well for large-sized organs, such as the mandible, but there is a slight difference in size and shape for small-sized organs. In the 3D view, there are very tiny differences in volume size and shape between the predicted mask and ground truth.

Conclusion

In conclusion, our model delineates OARs in HaN to better balance inference time and accuracy. SA blocks are introduced into the model, which aggregates multi-scale context information and encourages voxel grouping of the same organ. Our model only employs down-sampling once and introduces a receptive field block to balance the segmentation accuracy between large-sized and small-sized organs. In addition, its loss function combines the marginal loss and the mutual exclusion loss, which trains the model by partially supervised learning and exploits the prior information among voxels. Compared with natural images, there are more relatively fixed shapes and stable spatial structures in HaN CT images. The prior knowledge of OARs, such as shape, symmetry, and similarity, should be considered in the following research.