1 Introduction

Cancers in thoracic region are the most common cancers worldwide [17] and significant proportions of patients are diagnosed at late stages involved with lymph node (LN) metastasis. The treatment protocol is a sophisticated combination of surgical resection and chemotherapy and/or radiotherapy [5]. Assessment of involved LNs [1, 21] and accurate labeling their corresponding stations are essential for the treatment selection and planning. For example, in radiation therapy, the delineation accuracy of gross tumor volume (GTV) and clinical target volume (CTV) are the two most critical factors impacting the patient outcome. For CTV delineation, areas containing metastasis lymph nodes (LNs) should be included to sufficiently cover the sub-clinical disease regions [2]. One strategy to outline the sub-clinical disease region is to include the lymph node station (LNS) that containing the metastasized LNs [14, 19]. Thoracic LNS is determined according to the text definitions of International Association for the Study of Lung Cancer (IASLC) [15]. The delineation of LNS in the current clinical workflow is predominantly a manual process using computed tomography (CT) images. Visual assessment and manual delineation is a challenging and time-consuming task even for experienced physicians, since converting text definitions of IASLC to precise 3D voxel-wise annotations can be error prone leading to large intra- and inter-user variability [2].

Fig. 1.
figure 1

An illustration of LNS and key referencing organs. The top row illustrates the auto-searched top-6 key referencing organs; the bottom row depicts the 12 LNSs.

Deep convolutional neural networks (CNNs) have made remarkable progress in segmenting organs and tumors in medical imaging [4, 7,8,9, 18, 20]. Only a handful of non-deep learning studies have tackled the automated LNS segmentation [3, 11, 13, 16]. A LNS atlas was established using deformable registration [3]. Predefined margins from manually selected organs, such as the aorta, trachea, and vessels, were applied to infer LNSs [11], which was not able to accurately adapt to individual subject. Other methods [13, 16] built fuzzy models to directly parse the LNS or learn the relative positions between LNS and some referencing organs. Average location errors ranging from 6.9mm to 34.2mm were reported using 22 test cases in [13], while an average Dice score (DSC) of \(66.0\%\) for 10 LNSs in 5 patients was observed in [16].

In this work, we propose the DeepStationing – an anatomical context encoded deep LNS parsing framework with key organ auto-search. We first segment a comprehensive set of 22 chest organs related to the description of LNS according to IASLC guideline. As inspired by [4], the 22 organs are stratified into the anchor or non-anchor categories. The predictions of the former category are exploited to guide and boost the segmentation performance of the later category. Next, CT image and referencing organ predictions are combined as different input channels to the LNS parsing module. The 22 referencing organs are identified by human experts. However, relevant but different from the human process, CNN may require a particular set of referencing organs (key organs) that can opt for optimal performance. Therefore, we automatically search for the key organs by applying a channel-weighting to the input organ prediction channels based on differentiable neural search [10]. The auto-searched final top-6 key organs, i.e., esophagus, aortic arch, ascending aorta, heart, spine and sternum (shown in Fig. 1), facilitate our DeepStationing method to achieve high LNS parsing accuracy. We adopt 3D nnU-Net [6] as our segmentation and parsing backbone. Extensive 4-fold cross-validation is conducted using a dataset of 98 CT images with 12 LNS + 22 Organ labels each, as the first of its kind to date. Experimental results demonstrate that deep model encoded with the spatial context of auto-searched key organs significantly improves the LNS paring performance, resulting in an average Dice score (DSC) of \(81.1\%\,\pm \,6.1\%\), which is \(5.0\%\) and \(19.2\%\) higher over the pure CT-based deep model and the most recent relevant work [11] (from our re-implementations), respectively.

2 Method

Figure 2 depicts the overview of our DeepStationing framework, consisting of two major modularized components: (1) stratified chest organ segmentation; (2) context encoded LNS parsing with key organ auto-search.

Fig. 2.
figure 2

Overall workflow of our DeepStationing, which consists of stratified chest organ segmentation and anatomical context encoded LNS parsing with key organ auto-search.

2.1 Stratified Chest Organ Segmentation

To provide the spatial context for LNS parsing, we first segment a comprehensive set of 22 chest organs related to the description of LNS. Simultaneously segmenting a large number of organs increase optimization difficulty leading to sub-optimal performance. Motivated by [4], we stratify 22 chest organs into the anchor and non-anchor categories. Anchor organs have high contrast, hence, it is relatively easy and robust to segment them directly using the deep appearance features. Anchor organs are first segmented, and their results serve as ideal candidates to support the segmentation of other difficult non-anchors. We use two CNN branches to stratify the anchor and non-anchor organ segmentation. With predicted anchor organs as additional input, the non-anchor organs are segmented. Assuming N data instances, we denote the training data as \(\mathbb {S}=\left\{ X_n, Y_n^{\mathrm {A}}, Y_n^{\mathrm {\lnot A}}, Y_n^{\mathrm {L}}, \right\} _{n=1}^{N}\), where \(X_n\), \(Y_n^{\mathrm {A}}\), \(Y_n^{\mathrm {\lnot A}}\) and \(Y_n^{\mathrm {L}}\) denote the input CT and ground-truth masks for the anchor, non-anchor organs and LNS, respectively. Assuming there are \(C_{\mathrm {A}}\) and \(C_{\mathrm {\lnot A}}\) classes for anchor and non-anchor organs and dropping n for clarity, our organ segmentation module generate the anchor and non-anchor organ predictions at every voxel location, j, and every output class, c:

$$\begin{aligned} \hat{Y}^{\mathrm {A}}_c(j) = p^{\mathrm {A}}\left( Y^{\mathrm {A}}(j) = c\, |\, X ; \mathbf {W}^{\mathrm {A}}\right) \mathrm {,}&\quad \hat{\mathbf {Y}}^{\mathrm {A}}=\left[ \hat{Y}^{\mathrm {A}}_1\ldots \hat{Y}^{\mathrm {A}}_{C_{\mathrm {A}}} \right] \mathrm {,} \end{aligned}$$
(1)
$$\begin{aligned} \hat{Y}^{\mathrm {\lnot A}}_c(j) = p^{\mathrm {\lnot A}}\left( Y^{\mathrm {\lnot A}}(j) = c\, |\, X, \hat{\mathbf {Y}}^{\mathrm {A}}; \mathbf {W}^{\mathrm {\lnot A}}\right) \mathrm {,}&\quad \hat{\mathbf {Y}}^{\mathrm {\lnot A}}=\left[ \hat{Y}^{\mathrm {\lnot A}}_1\ldots \hat{Y}^{\mathrm {\lnot A}}_{C_{\mathrm {\lnot A}}} \right] \mathrm {,} \end{aligned}$$
(2)

where \(p^{(*)}(.)\) denotes the CNN functions and \(\hat{Y}^{(*)}_c\) for the output segmentation maps. Here, we combine both anchor and non-anchor organ predictions into an overall prediction map \(\hat{\mathbf {Y}}^{\mathfrak {A}}=\hat{\mathbf {Y}}^{\mathrm {A}} \cup \hat{\mathbf {Y}}^{\mathrm {\lnot A}}\). Predictions are vector valued 3D masks as they provide a pseudo-probability for every class. \(\mathbf {W}^{(*)}\) represents the corresponding CNN parameters.

2.2 Anatomical Context Encoded LNS Parsing

Segmenting LNS by only CT appearance can be error prone, since LNS highly relies on the spatial context of adjacent anatomical structures. Emulating the clinical practice of IASLC guidelines, we incorporate the referencing organs into the training process of LNS parsing. Given \(C_{\mathrm {L}}\) classes of the LNSs, as illustrated in Fig. 2, we combine the above organ predictions with CT images to create a multi-channel input: \(\left[ X, \,\, \hat{\mathbf {Y}}^{\mathfrak {A}} \right] \):

$$\begin{aligned} \hat{Y}^{\mathrm {L}}_c(j) = p^{\mathrm {L}}\left( Y^{\mathrm {L}}(j) = c \, | \, X, \hat{\mathbf {Y}}^{\mathfrak {A}}; \mathbf {W}^{\mathrm {L}}\right) \mathrm {,} \quad \hat{\mathbf {Y}}^{\mathrm {L}} = \left[ \hat{Y}^{\mathrm {L}}_1\ldots \hat{Y}^{\mathrm {L}}_{C_{\mathrm {L}}} \right] \mathrm {.} \end{aligned}$$
(3)

Thereupon, the LNS parsing module leverages both the CT appearance and the predicted anatomical structures, implicitly encoding the spatial distributions of referencing organs during training. Similar to Eq. (1), we have the LNS prediction in its vector-valued form as \(\hat{\mathbf {Y}}^{\mathrm {L}}\).

Key Organ Auto-Search. The 22 referencing organs are previously selected according to the IASLC guideline. Nevertheless for deep learning based LNS model training, those manually selected organs might not lead to the optimal performance. Considering the potential variations in organ location and size distributions, and differences in automated organ segmentation accuracy, we hypothesize that the deep LNS parsing model would benefit from an automated reference organ selection process that are tailored to this purpose. Hence, we use the differentiable neural search [4] to search the key organs by applying a channel-weighting strategy to input organ masks. We make the search space continuous by relaxing the selection of the referencing organs to a Softmax function over the channel weights of the one-hot organ predictions \(\hat{\mathbf {Y}}^{\mathfrak {A}}\). For \(C_{\mathrm {L}}\) classes, we define a set of \(C_{\mathrm {L}}\) learn-able logits for each channel, denoted as \(\alpha _c, \forall c \in \left[ 1\cdots C_{\mathrm {L}}\right] \). The channel weight \(\phi _c\) for a referencing organ is defined as:

$$\begin{aligned} \phi _c = \dfrac{\text {exp}\left( \alpha _{c} \right) }{\sum _{m=1}^{C_{\mathrm {L}}}\text {exp}\left( \alpha _{m} \right) }&\mathrm {,} \quad \varPhi = \left[ \phi _1 \cdots \phi _{C_{\mathrm {L}}} \right] \mathrm {,} \end{aligned}$$
(4)
$$\begin{aligned} F(\hat{Y}^{\mathfrak {A}}_c, \phi _c) = \phi _c \cdot \hat{Y}^{\mathfrak {A}}_c&\mathrm {,} \quad F (\hat{\mathbf {Y}}^{\mathfrak {A}}, \varPhi ) = \left[ F(\hat{Y}^{\mathfrak {A}}_1, \phi _1) \cdots F(\hat{Y}^{\mathfrak {A}}_{C_{\mathrm {L}}}, \phi _{C_{\mathrm {L}}}) \right] \end{aligned}$$
(5)

where \(\varPhi \) denotes the set of channel weights and \(F(\phi _c, \hat{Y}^{\mathfrak {A}}_c)\) denotes the channel-wise multiplication between the scalar \(\phi _c\) and the organ prediction \(\hat{Y}^{\mathfrak {A}}_c\). The input of LNS parsing model becomes \(\left[ X, \,\, F (\hat{\mathbf {Y}}^{\mathfrak {A}}, \varPhi ) \right] \). As the results of the key organ auto-search, we select the organs with the top-n weights to be the searched n key organs. In this paper, we heuristically select the \(n=6\) based on the experimental results. Last, we train the LNS parsing model using the combination of original CT images and the auto-selected top-6 key organs’ segmentation predictions.

3 Experimental Results

Dataset. We collected 98 contrast-enhanced venous-phase CT images of patients with esophageal cancers underwent surgery and/or radiotherapy treatments. A board-certified radiation oncologist with 15 years of experience annotated each patient with 3D masks of 12 LNSs, involved LNs (if any), and 22 referencing organs related to LNS according to IASLC guideline. The 12 annotated LN stations are: S1 (left + right), S2 (left + right), S3 (anterior + posterior), S4 (left + right), S5, S6, S7, S8. The average CT image size is \(512 \times 512 \times 80\) voxels with an average resolution of \(0.7 \times 0.7 \times 5.0\) mm. Extensive four-fold cross-validation (CV), separated at the patient level, was conducted. We report the segmentation performance using DSC in percentage, Hausdor distance (HD) and average surface distance (ASD) in mm.

Table 1. Mean DSCs, HDs, and ASDs, and their standard deviations of LNS parsing performance using: (1) only CT appearance; (2) CT\(+\)all 22 referencing organ ground-truth masks; (3) CT\(+\)all 22 referencing organ predicted masks; (4) CT\(+\)auto-searched 6 referencing organ predicted masks. The best performance scores are shown in bold.

Implementation Details. We adopt the nnU-Net [6] with DSC+CE losses as our backbone for all experiments due to its high accuracy on many medical image segmentation tasks. The nnU-Net has been proposed to automatically adapt different preprocessing strategies (i.e., the training image patch size, resolution, and learning rate) to a given 3D medical imaging dataset. We use the default nnU-Net settings for our model training. The total training epochs is 1000. For the organ auto-search parameter \(\alpha _c\), we first fix the \(\alpha _c\) for 200 epochs and alternatively update the \(\alpha _c\) and the network weights for another 800 epochs. The rest settings are the same as the default nnU-Net setup. We implemented our DeepStationing method in PyTorch, and an NVIDIA Quadro RTX 8000 was used for training. The average training/inference time is 2.5 GPU days or 3 mins.

Quantitative Results. We first evaluate the performance of our stratified referencing organ segmentation. The average DSC, HD and ASD for anchor and nonanchor organs are \(90.0\pm 4.3\%\), \(16.0\pm 18.0\) mm, \(1.2\pm 1.1\) mm, and \(82.1\pm 6.0\%\), \(19.4\pm 15.0\) mm, \(1.2\pm 1.4\) mm, respectively. We also train a model by segmenting all organs using only one nnUNet. The average DSCs of the anchor, non-anchor, and all organs are \(86.4\pm 5.1\%\), \(72.7\pm 8.7\%\), and \(80.8\pm 7.06\%\), which are \(3.6\%\), \(9.4\%\), and \(5.7\%\) less than the stratified version, respectively. The stratified organ segmentation demonstrates high accuracy, which provides robust organ predictions for the subsequent LNS parsing model.

Fig. 3.
figure 3

(a) Examples of LNS parsing results using different setups. For better comparison, red arrows are used to depict visual improvements. (b) The bottom charts demonstrate the performance using different numbers of searched referencing organs.

Table 1 outlines the quantitative comparisons on different deep LNS parsing setups. Columns 1 to 3 show the results using: 1) only CT images, 2) CT \(+\) all 22 ground-truth organ masks, and 3) CT \(+\) all 22 predicted organ masks. Using only CT images, LNS parsing exhibits lowest performance with an average DSC of \(76.1\%\) and HD of 22.6 mm. E.g., distant false predictions is observed in the first image \(2^{nd}\) row of Fig. 3 and false-positive S3 posterior is predicted (in pink) between the S1 and S2. When adding 22 ground-truth organ masks as spatial context, both DSC and HD show remarked improvements: from \(76.1\%\) to \(79.8\%\) in DSC and 22.6mm to 11.8 mm in HD. This verifies the importance and effectiveness of referencing organs in inferring LNS boundaries. However, when predicted masks of the 22 organs are used (the real testing condition), it has a significant increase in HD from 11.8 mm to 28.9 mm as compared to that using ground truth organ masks. This shows the necessity to select the key organs suited for the deep parsing model. Finally, using the top-6 auto-searched referencing organs, our DeepStationing model achieves the best performance reaching 81.1 ± 6.1% DSC, 9.9 ± 3.5 mm HD and 0.9 ± 0.6 mm ASD. Qualitative examples are shown in Fig. 3 illustrating these performance improvements.

We auto-search for the organs that are tailored to optimize the LNS parsing performance. Using an interval of 3, we train 7 additional LNS parsing models, by including the top-3 up to top-21 organs. The auto-searched ranking of the 22 organs is listed as follows: esophagus, aortic arch, ascending aorta, heart, spine, sternum, V.BCV (R+L), V.pulmonary, descending aorta, V.IJV (R+L), A.CCA (R+L), V.SVC, A.pulmonary, V.azygos, bronchus (R+L), lung (R+L), trachea, where ‘A’ and ‘V’ denote the Artery and Vein. The quantitative LNS parsing results in selecting the top-n organs are illustrated in the bottom charts of Fig. 3. With more organs included gradually, the DSC first improves, then slightly drops after having more than top-6 organs. The performance later witnesses a sharp drop after including more than top-9 organs, then becoming steady when we include more than top-15 organs. This demonstrates that deep LNS paring model does not need a complete set of referencing organs to capture the LNS boundaries. We choose the top-6 as our final key organs based on experimental results. We notice that the trachea, lungs, and bronchus are surprisingly ranked in the bottom-5 of the auto-search, although previous works [11, 12] manually selected them for the LNS parsing. The assumed reasons are that those organs are usually filled with air and have clear boundaries while LNS does not include air or air-filled organs. With the help of the other found key organs, it is relatively straightforward for the LNS parsing CNN to distinguish them and reject the false-positives located in those air-filled organs. We further include 6 ablation studies and segment LNS using: (1) randomly selected 6 organs; (2) top-6 organs with best organ segmentation accuracy; (3) anchor organs; (4) recommended 6 organs from the senior oncologists; (5) searched 6 organs predictions from less accurate non-stratified organ segmentor; (6) searched 6 organs GT. The randomly selected 6 organs are: V.BCV (L), V.pulmonary, V.IJV (R), heart, spine, trachea; The 6 organs with the best segmentation accuracy are: lungs (R+L), descending aorta, heart, trachea, spine; Oncologists recommended 6 organs are: trachea, aortic arch, spine, lungs (R+L), descending aorta; The DSCs for setups (1–6) are 77.2%, 78.2%, 78.6%, 79.0%, 80.2%, 81.7%; the HDs are 19.3 mm, 11.8 mm, 12.4 mm, 11.0 mm, 10.1 mm, 8.6 mm, respectively. In comparison to the LNS predictions using only CT images, the ablation studies demonstrate that using the referencing organ for LNS segmentation is the key contributor for the performance gain, and the selection and the quality of supporting organs are the main factors for the performance boost, e.g., our main results of the setups (5) and (6) show that better searched-organ delineation can help get superior LNS segmentation performance.

Comparison to Previous Work. We compare the DeepStationing to the previous most relevant approach [11] that exploits heuristically pre-defined spatial margins for LNS inference. The DeepStationing outperforms [11] by \(19.2\%\) in DSC, 30.2 mm in HD, and 5.2 mm in ASD. For the ease of comparison, similar to [11], we also merge our LNSs into four LN zones, i.e., supraclavicular (S1), superior (S2, S3, and S4), aortic (S5 and S6) and inferior (S7 and S8) zones, and calculate the accuracy of LN instances that are correctly located in the predicted zones. DeepStationing achieves an average accuracy of \(96.5\%\), or \(13.3\%\) absolutely superior than [11] in LN instance counting accuracy. We tested additionally 2 backbone networks: 3D PHNN (3D UNet with a light-weighted decoding path) and 2D UNet. The DSCs of 3D PHNN and 2D UNet are 79.5% and 78.8%, respectively. The assumed reason for the performance drop might be the loss of the boundary precision/3D information.

4 Conclusion

In this paper, we propose DeepStationing as a novel framework that performs key organ auto-search based LNS parsing on contrasted CT images. Emulating the clinical practices, we segment the referencing organs in thoracic region and use the segmentation results to guide LNS parsing. Different from employing the key organs directly suggested by oncologists, we search for the key organs automatically as a neural architecture search problem that can opt for optimal performance. Evaluated using a most comprehensive LNS dataset, DeepStationing method outperforms previous most relevant approach by a significant quantitative margin of 19.2% in DSC, and is coherent to clinical explanation. This work is an important step towards reliable and automated LNS segmentation.