Introduction

Extranodal extension (ENE) occurs when metastatic tumor cells within a lymph node break through the nodal capsule into surrounding tissues [1]. It is characterized clinically by skin invasion, soft-tissue invasion with deep tethering to underlying muscle or adjacent structures, or clinical signs of nerve involvement [2]. The presence of ENE is associated with higher rate of local recurrence and poorer survival in head and neck cancers [3]. The AJCC 8th edition introduces the use of ENE in the “N” category for metastases in cervical lymph nodes [3].

ENE can be reliably diagnosed only by postoperative pathological specimens [1, 3, 4]. The identification of pathologic ENE is an indication for adjuvant treatment intensification [1, 3, 5]. Patients being treated without surgery would benefit from concurrent chemoradiotherapy, and patients being treated with surgery may require adjuvant chemotherapy [5, 6]. The detection of ENE prior to treatment may be helpful in guiding subsequent therapy [1, 3, 5,6,7].

Contrast-enhanced CT is the imaging modality most commonly used to evaluate cervical lymph node status [5]. The CT findings suggesting ENE are size increase, central necrosis with contour irregularity, irregular borders with infiltration into adjacent fat and muscle planes, and/or gross invasion [4, 5]. Although CT is recommended to visualize macroscopic ENE [4], its diagnostic performances is reported to be suboptimal, with an area under the curve (AUC) of the receiver-operating characteristic (ROC) plot ranging from 0.65 to 0.69 [1, 3]. In addition, high intra-observer variability is found [1].

Artificial intelligence with a deep learning system has increasingly been applied to medical fields [8, 9], especially to diagnostic imaging. A deep learning system using multiple layered convolutional neural networks (CNNs) can extract and analyze quantitative image features automatically, and has created predictive models [10,11,12,13].

Kann et al. have achieved an AUC of 0.91 in diagnosis of ENE on CT images using the deep learning architecture DualNet [1]. Although the actual ability and size of the graphics processing unit (GPU) are not described in detail, DualNet, which is used in the go gaming program AlphaGo [14, 15], can easily be identified as requiring a large and expensive GPU. In our previous study using a relatively low-cost system with a neural network of AlexNet and the deep learning training system DIGITS on an 11 GB GPU machine (NVIDIA Corporation, Holmdel NJ, USA), a high AUC of 0.80 was achieved in the diagnosis of lymph node metastases in oral cancer patients on contrast-enhanced CT [16].

The purpose of the this study was to verify the possibility of using a relatively low-cost deep learning system for diagnosing ENE of cervical lymph node metastases on contrast-enhanced CT in oral cancer patients by comparing its diagnostic performance with that of radiologists.

Materials and methods

This study was approved from the Ethics Committee of our University (No. 496), and planned according to the ethical standards of the Helsinki Declaration.

Subjects

The subjects were selected from patients whose contrast-enhanced CT imaging data were stored in the image database of our hospital between 2007 and 2018. Fifty-one patients with oral squamous cell carcinoma, who underwent neck dissection in our hospital, and who were pathologically confirmed to have cervical lymph node metastasis, were registered. They were 27 men and 24 women, and the age ranged from 28 to 94 years with a median of 64 years.

The contrast-enhanced CT examinations were performed by an Asterion TXT machine (Canon Medical Systems, Otawara, Japan). Patients received a 100-mL injection of iodinate contrast media (Iopamiron 300; Bayer Yakuhin, Ltd., Osaka, Japan), with 300 mg of iodine /mL at a rate of 20 mL/s. Axial scans from the skull base to the superior mediastinum were acquired parallel to the Frankfort plane, with tube voltage of 120kVp and amperage of 100mAs. The other parameters were slice thickness of 0.5 mm, pitch of 0.3 mm, and field of view of 20 cm.

The lymph nodes on CT images and dissected specimens were carefully investigated to obtain one-to-one correspondence between them. A total of 143 metastatic lymph nodes were identified with one-to-one correspondence. These lymph nodes were fixed in 10% formalin and then embedded in paraffin. The specimens were sliced 2.5 µm in thickness and stained with hematoxylin and eosin. A center-sliced specimen was re-evaluated by an oral pathologist (YS) to determine the presence or absence of ENE. As a result, 33 metastatic lymph nodes showed ENE (Group 1), and 110 metastatic lymph nodes showed no evidence of ENE (Group 0).

The 4–6 consecutive axial CT images were selected for each lymph node: one slice was centered on the lymph node, and two or three slices were located above and below the lymph node center. The adopted CT images were 178 images in Group 1 and 525 images in Group 0. A radiologist cropped all images into arbitrarily sized squares including histopathologically proven metastatic lymph nodes and surrounding tissues using the macro function of Adobe Photoshop v. 13.0 (Adobe Systems Co. Ltd., San Jose CA, USA). The squares ranged from 4 to 32 mm (median 10.8 mm) with a 1-mm size represented by 51 pixels (Fig. 1).

Fig. 1
figure 1

Cropping CT images. CT images including lymph nodes and surrounding tissues were cropped into arbitrarily sized squares. The squares ranged from 4–32 mm (median 10.8 mm) with a 1-mm size represented by 51 pixels

Preparation of training and testing imaging datasets

All imaging patches were automatically divided into two datasets using an automated selection method (Fig. 2), assigning 80% to a training dataset and 20% to a testing dataset. Although this method did not assign the images, which were obtained from the same lymph node or from the same patient, as the different groups, it could repeat the training and testing processes with randomly selected slices.

Fig. 2
figure 2

Automated selection method. All imaging patches were automatically divided into two datasets using the automated selection method, assigning 80% as training dataset and 20% as testing dataset. Gr group

Deep learning procedure

A deep learning system was built on graphic cards (GeForce GTX 1080 Ti, NVIDIA) with 11 GB of GPU, 128 GB of memory, and the open-source operating system Ubuntu OS v. 16.04.2. The prepared training datasets were imported into the deep learning training system DIGITS library v. 5.0 (NVIDIA; https://developer.nvidia.com/digits). The learning process for 300 epochs was performed using the CNN “AlexNet”, which consists of five convolutional layers and three fully connected layers. For a deep learning framework, the open-source Convolutional Architecture for Fast Feature Embedding (Caffe) was used.

The automated selection method was repeated five times, resulting in five learning models. Each testing dataset was applied to each created learning model and resulting five performances were averaged as estimated diagnostic performances for accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Receiver-operating characteristic (ROC) curves were generated and the areas under the curves (AUCs) were determined.

Diagnostic performance of radiologists

Diagnostic performances of radiologists were determined when they used three characteristic CT features suggesting ENE, including a minor axis > 11 mm, central necrosis, and irregular borders. A randomly selected center or its adjacent slice images of metastatic lymph nodes were used for determining the radiologists’ performances. Sixty-six respective slice images were selected from Group 1 and Group 0. The minor axis was defined as the maximum diameter perpendicular to the long axis of a lymph node on CT images (Fig. 3a). One radiologist (YA) with more than 20 years of experience measured the minor axis twice and averaged the values. The 11-mm threshold was determined from the value at which the largest AUC was obtained in the preliminary ROC analysis (data not shown). For central necrosis and irregular borders, three radiologists > 10 years of experience evaluated whether these features were present or absent (Fig. 3b–e). After practice on several samples with central necrosis or irregular borders, actual interpretations were performed on a personal monitor (RadiForce G20; Eizo Nanao Corp., Ishikawa, Japan), with a size of 20.1 inches and resolution of 1600 × 1200 pixels. The observers evaluated the probability for the presence of central necrosis or irregular borders on a 4-point rating scale: 1, absent; 2, probably absent; 3, probably present; and 4, present. The evaluation was deemed negative when the scores showed 1 and 2, and was deemed positive when the scores showed 3 and 4. The accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and the AUC values were calculated.

Fig. 3
figure 3

Evaluation of CT features by radiologists. A radiologist measured twice and averaged the minor axis, which was defined as the maximum diameter perpendicular to the long axis of the lymph node on CT (a). Three radiologists evaluated whether central necrosis was present (b) or absent (c). They also evaluated whether irregular borders were present (d) or absent (e)

Statistical analysis

Comparisons of AUC values were performed by the Chi-squared test. Values of p < 0.05 were considered as statistically significant.

Results

Time required for deep learning process

The time required to import the training dataset into DIGITS was 6 s. The time to perform 300 epochs learning process using AlexNet and create the learning model was 9 min. The time to adapt a testing dataset into the learning model and judge the presence of ENE was 11 s.

Diagnostic performance of deep learning system and radiologists

The diagnostic performances for five models are shown in Table 1. In each model, the accuracies were > 80%. The estimated accuracy of 84.0% and the specificity of 89.7% were fully expectable values, while the sensitivity of 66.9% was not high yet.

Table 1 Diagnostic performance of the deep learning system using automatic selection

In the evaluation of central necrosis and irregular borders by three radiologists, the inter-observer agreements in kappa values were 0.798 (substantial agreement) and 0.528 (moderate agreement), respectively. The AUCs based on minor axis, central necrosis, and irregular borders were 0.553, 0.515, and 0.629, respectively. The AUC of the deep learning system was significantly different from that of radiologists when they used the minor axis (p = 0.0039) and central necrosis (p = 0.0011) as indicators of ENE (Table 2, Fig. 4).

Table 2 Comparison of diagnostic performance between deep learning systems and radiologists
Fig. 4
figure 4

Receiver-operating characteristic (ROC) curves for the deep learning system and radiologists

Case presentations

Case 1. A 61-year-old woman had a carcinoma in the floor of mouth with a area of 15 × 10 mm (T1N0) at the first visit. One year and two months after surgery, a right submandibular lymph node developed to the delayed metastasis. A CT image of the lymph node showed a 13-mm minor axis, central necrosis, and clear borders (Fig. 5a). The deep learning classification correctly diagnosed as no ENE in adaptation of all 5 learning models. Histopathological findings showed infiltration of tumor cells into the fibrotic capsule but no ENE (Fig. 5b).

Fig. 5
figure 5

Case presentations. Case 1. CT image of the lymph node shows a 13-mm minor axis, central necrosis, and clear borders (a). Histopathological findings (H&E, × 40) showed infiltration of tumor cells into the fibrotic capsule but no ENE (b). Case 2. CT image of the lymph node shows a 12 mm minor axis, central necrosis, and irregular borders (c). The histopathological findings (H&E, × 40) showed that the capsule was destroyed, indicating ENE (d)

Case 2. A 34-year-old woman had a tongue cancer with an area of 33 × 22 mm and depth of invasion of 6 mm (T2N0) at the first visit. A delayed metastasis to a right upper jugular lymph node occurred 3 months after the surgery. CT image of the lymph node showed a 12-mm minor axis, central necrosis, and irregular borders (Fig. 5c). The judgments as to irregular borders were different among radiologists, and it was decided as positive after discussion. Three of five leaning models created by the deep learning system correctly diagnosed having ENE, but diagnosis in two models was incorrect. The histopathological findings showed that the capsule was destroyed, indicating ENE (Fig. 5d).

Discussion

Cervical lymph node metastasis with ENE in head and neck squamous cell carcinoma is a critical prognostic factor for disease-free survival and distant metastasis [5, 17, 18], and influences treatment planning [18]. When ENE is confirmed by histological examination, additional treatment is administered [19]. Therefore, it is desirable to know ENE status prior to surgery from clinical information, including imaging diagnosis [7]. However, considering the accuracy and limitations of current imaging, treatment planning cannot rely on imaging findings [4].

In this study, we applied a relatively low-cost deep learning algorithm to CT diagnosis of ENE, confirmed its diagnostic performance, and examined the possibility of clinical application. The deep learning systems using multi-layer CNNs can automatically extract features from raw images and classify them [10,11,12,13]. In the method automatically assigning image patches to training and testing datasets, learnings were repeated five times to minimize assigning bias. As a result, an AUC of 0.88 was obtained. The deep learning method presented little inter-model variability, being different from radiologists’ interobserver variation [1].

The conventional imaging diagnosis of ENE has been performed based on a minor axis threshold of ≥ 10 mm, central necrosis, and unclear boundaries, mainly using contrast-enhanced CT [1,2,3]. The reported sensitivity was not sufficient at 43–83%, but its specificity was high in the range of 72–98% [4, 20, 21]. It means that even if the images show negative findings, some may have ENEs.

Other weaknesses of conventional imaging diagnosis are that the inter-observer agreement was not so high, 0.37–0.59 [7, 20]. In this study, the inter-observer κ value for central necrosis was 0.798, indicating substantial agreement, while that of the irregular borders showed a moderate agreement of 0.528.

Among the characteristic imaging findings, central necrosis was strongly correlated with histopathologically confirmed ENE [5]. Aiken et al. stated that central necrosis was the most detectable finding of ENE [4]. However, this study showed that the sensitivity based on central necrosis was not high, probably due to differences in patient distribution.

Increasing lymph node diameter will be expected to be ENE [22]. The sensitivity based on larger > 10 mm in diameter was reported as 47–55% [6]. One large series reported that one-third of nodes with ENE were 10 mm or smaller [22, 23]. Zoumalan et al. found that the mean diameter of nodes with and without ENE did not differ [5]. In this study, the cutoff value of the minor axis was determined to be 11 mm in the preliminary analysis of the ROC curves, and the sensitivity based on the minor axis was confirmed to be low similar as the previous reports [6].

Inter-observer agreement for evaluation of irregular borders was reported to be low [7]. This study confirmed that the κ value of three radiologists was moderately low, 0.528. Of the three characteristic findings, the sensitivity was the highest, but still not sufficient. If invasion into adjacent structures is evident, diagnosis is easier [21]. The presence of matted nodes, defined as three adjacent cervical lymph nodes abutting one another with loss of an intervening fat plane, may be a positive predictive factor for ENE [7]. Aiken et al. stated that the histopathological ENE was still present in nearly 50% of the cases not showing imaging positive signs [4]. Therefore, ENE cannot be excluded even if imaging findings suggest negativity [3, 4].

PET/CT can detect aggressive tumors, whereas it generally has lower spatial resolution and may not improve the performance for detection of ENE compared with CT alone [1]. MRI has superior soft-tissue resolution, and therefore, it may show good performance in delineating infiltration of adjacent fat planes and contour irregularity [7]. However, it has inferior spatial resolution and can have large motion artifacts [7]. Its accuracy of detection of ENE beyond CT cannot be expected [24, 25]. Ultrasonography has the highest spatial resolution and can reveal structural details including nodal matting, perinodal edema, and indeterminate boundaries [7, 26]. It lacks ionizing radiation, and can provide complementary information with Doppler imaging, but there has been little research on this topic [7, 26]. Further studies to adapt deep learning to PET/CT and MRI will be needed.

This study has some limitations. The data from other facilities was not used. External validation and prospective testing acquired with different in scan-specific parameters, including tube voltage and IV contrast protocol, should be performed to create a generalizable model [1]. Another weakness was the small sample size. For the ultimate goal of this project to develop a usable clinical assistance tool, further study will be needed.

In conclusion, the deep learning diagnostic performance of ENE was sufficiently higher than those of radiologists. This method is expected to provide diagnostic support by further study with increasing patients’ number.