Introduction

According to the latest global cancer statistics, the incidence of thyroid cancer is rising worldwide [1]. As the most common thyroid malignancy, papillary thyroid carcinoma (PTC) accounts for approximately 85–95%, and is an indolent disease [2, 3]. However, approximately 30–80% of PTCs are associated with lymph node (LN) metastases [4,5,6], and some studies have shown that LN metastases are associated with disease relapse [7]. It is generally considered that central compartment lymph node dissection (CLND) is required for patients with clinically involved LNs in the central or lateral compartment [8]. Therefore, the accurate identification of central compartment lymph node metastasis (CLNM) is crucial for the optimal management of patients with PTC.

As CLNM is difficult to detect preoperatively, controversy has always existed regarding the role of routine CLND [9]. In some medical institutions, it is supported that CLND may reduce PTC recurrence, upgrade the tumour node metastasis stage and indicate risk stratification for recurrence. Accordingly, treatment regimens, such as radioactive iodine-131 therapy, may be altered. In general, CLND can allow PTC patients to receive more active treatment and less potentially hazardous reoperative surgery. In contrast, the 2015 guidelines of the American Thyroid Association (ATA) do not recommend prophylactic CLND, stating that prophylactic CLND may be considered in high-risk patients with advanced primary tumours [9]; the ATA supports the viewpoint that there is still insufficient evidence to show that prophylactic CLND is beneficial in reducing recurrence rates [10, 11]. In contrast, CLND increases potential surgical risk for patients [12, 13]. The 2014 Japanese Society of Thyroid Surgeons and Japanese Association of Endocrine Surgeons (JSTSJAES) guidelines note that in the absence of definitive data about prophylactic CLND in a large series of patients, its indication depends on the institutional policy and surgeons’ skill levels [14].

As the preferred inspection method for thyroid cancer, ultrasound (US) has limited power for evaluating CLNM [15]; indeed, the sensitivity of US in diagnosing CLNM ranges from 20 to 60% [16,17,18,19,20]. At present, there is no uniform standard for weighing the pros and cons of prophylactic CLND. Thus, there is an urgent need for quantitative means to predict CLNM preoperatively, and attention should be focused on risk stratification.

Although several studies have reported high-risk factors relative to clinical and ultrasound features predictive of CLNM in PTC [4,5,6], the results have been inconsistent. In addition, some of the risk factors identified, such as extrathyroidal invasion and tumour differentiation, are only available postoperatively [21, 22] and cannot provide valuable information for aiding in pretreatment decision-making. Therefore, the development of an appropriate and noninvasive approach for assessing CLNM has been challenging.

Our study aimed not only to identify risk factors that may predict CLNM but also to develop and validate a nomogram by combining clinical and ultrasound features, an accurate and easy-to-use model for preoperatively quantifying the likelihood of CLNM in an objective manner.

Materials and methods

Patients

This retrospective study was approved by the Ethics Committee of the Chinese PLA General Hospital (Beijing, China), and the requirement for informed consent was waived. We retrospectively evaluated patients with histologically confirmed PTC treated in our hospital between July 2018 and March 2019. They were enrolled according to the following inclusion criteria: (1) at least one suspected malignant thyroid nodule was identified and confirmed to be malignant by US-guided puncture biopsy; (2) patients who underwent initial thyroid surgery with CLND and were confirmed as having PTC histologically; (3) no other treatment was performed before the operation; and (4) the thyroid ultrasound examination performed in our department occurred within 1 month before the operation. Patients were excluded based on the following: (1) they had distant metastases or malignant tumours in other organs; (2) they received another treatment before surgery; (3) the ultrasound imaging information was incomplete, or the quality of the images was poor; or (4) they had skip metastases [23]. Figure 1 shows the patient recruitment process. Ultimately, 952 patients (267 men and 685 women, mean age 43.2 ± 10.9 [range, 15–77] years) were included and divided into training and validation datasets according to the time of surgery. Each dataset was divided into CLNM-negative and CLNM-positive groups according to the pathology results. Baseline clinical data, including age and sex, were collected from medical records, and the threshold value of age was confirmed according to the analysis of optimal scaling regression.

Fig. 1
figure 1

Flow chart of the patients enrolled in our study

Ultrasound image acquisition and assessment of ultrasound imaging features

US examinations were performed using an S2000 (Siemens Healthineers) system equipped with a 5–14-MHz linear probe and an IU22 (Philips) system equipped with a 5–12-MHz linear probe. The US imaging characteristics of each patient were retrospectively reviewed by two independent radiologists with more than 10 years of experience in thyroid imaging; neither observer was aware of the clinical nor the pathological outcome. If the radiologists disagreed, they met to determine their final decisions by a consensus. The imaging characteristics of each nodule included tumour size, multifocality, aspect ratio (height divided by width on transverse views, A/T), tumour site, distance between the nodules and the adjacent capsule, microcalcification distribution, tumour internal vascularity and Hashimoto’s thyroiditis. Multiple images of the longitudinal and transverse axes were fully evaluated. Tumour size refers to the maximum diameter (D) of the nodule, as classified according to the analysis of optimal scaling regression, as follows: D ≤ 0.5 cm, 0.5 < D ≤ 1.0 cm, 1.0 < D ≤ 1.5 cm and D > 1.5 cm. If there was a suspicion of malignancy for more than one nodule, we defined it as multifocality. In multifocal cases, tumour size was classified according to the diameter of the largest tumour. The A/T was classified as ≤ 1 or > 1. The location of the tumour was evaluated from three aspects: location 1, location 2 and location 3. These locations were divided into the following categories: upper, mid, lower, left lobe, right lobe, isthmic, inner side, outer side and middle. The microcalcification pattern was categorised as absent, present, with multiple microcalcifications and diffuse distribution within the nodules on the US image (Fig. 2a–d). The microcalcification value was defined as less than or equal to 2 mm, and the multiple microcalcification value was defined as more than five punctate high echoes within a single nodule. The relationship between the tumour and adjacent capsule was classified into three categories as follows: protruding outside the thyroid capsule, ≤ 2 mm (including contact with the capsule), and > 2 mm (Fig. 2e–h). Tumour vascularity was classified in accordance with the Adler criterion [24] from 0 to 3 and evaluated by colour Doppler flow imaging (CDFI). Hashimoto’s thyroiditis was diagnosed on the basis of US images.

Fig. 2
figure 2

Classification of the ultrasound imaging features. The microcalcification pattern was categorised as absent (a), present (b), with multiple microcalcifications (c) and diffuse distribution within the nodules (d). The distance between the tumour and the capsule was defined as the shortest distance from the tumour border to the thyroid capsule on transverse and longitudinal views. It was classified into three categories as follows: protruding outside the thyroid capsule (e), ≤ 2 mm (f, g) and > 2 mm (h)

Considering that the diagnostic performance of our model depends on the accuracy of operator-reported imaging features, interobserver reproducibility for ultrasound features was assessed.

US-reported LN status

In the preoperative evaluation of cervical LNs, an LN was considered suspicious if it had one of the following features: loss of the fatty hilum, microcalcifications, hyperechoic change, a round shape or necrosis [25, 26].

Feature selection and ultrasound signature construction

We performed least absolute shrinkage and selection operator (LASSO) regression [27] to select the strongest predictive features among all the US imaging characteristics in the training cohort. The LASSO regression model operates by shrinking the coefficients of useless features to zero with the regulation parameter λ. The remaining nonzero coefficients were selected to build a logistic regression model, and the combination of these features is called the US signature in our study. Its predictive performance was assessed by ROC analysis.

Model construction and nomogram establishment in the training cohort

We performed multivariate logistic regression analysis by combining the US signature with clinical characteristics, including age, sex and US-reported LN status. A predictive model named the combined model was thus constructed. This model is presented as a nomogram that can visually and individually indicate the probability of CLNM.

Evaluation of the predictive model

The prediction formula based on the primary cohort was applied to all PTC patients in the validation cohort, and the probability of CLNM was calculated. For calibration of the model, calibration curves were plotted using pathological results and the nomogram prediction probabilities of CLNM. We evaluated the goodness of fit of the model using the Hosmer-Lemeshow test, a significant statistical method utilised to test whether the model is calibrated perfectly [28]. Nomogram discrimination was quantified using a ROC curve. Decision curve analysis was conducted to estimate the clinical utility of the nomogram.

Statistical analysis

Statistical analysis was conducted with SPSS Statistics version 24.0 (IBM Corp.) and R software version 3.5.3 (The R Foundation for Statistical Computing). Categorical variables were reported as numbers and percentages. A chi-square test or Fisher’s exact test was used to assess differences between groups. The Mann–Whitney U test was applied for continuous variables. The reported statistical significance levels were all two-sided, with statistical significance set at 0.05.

Interobserver agreement was analysed for each variable using kappa (k) statistics. For continuous variables, the agreement of tumour size between US imaging features and pathological results was evaluated by Spearman’s correlation analysis. ROC curve analysis was employed to determine the appropriate cut-off value for the probabilities of CLNM corresponding to the maximal Youden index, and the sensitivity, specificity, accuracy, positive predictive value (PPV) and negative predictive value (NPV) were calculated.

Results

Characteristics of patients and features of nodules

Among the 952 patients, there was a significant difference in sex between CLNM-positive and CLNM-negative patients. CLNM was found in 64.04% of male and 44.09% of female patients (p < 0.001). Young age was highly predictive of CLNM; for male patients, 40 years was confirmed as the threshold value for age, whereas the age distribution was particularly uneven for female patients with CLNM. To adjust for the age factor, we developed two separate nomograms for female patients, young females (≤ 35 years) and elder females (> 35 years), with CLNM probabilities of 58.25% and 38.49%, respectively. The accuracy of the subjective US-reported LN status was only 0.656 for the entire cohort, with a high specificity of 93.1% but a poor sensitivity of 39.9%. There were 284 patients who were reported to be LN negative but confirmed to have CLNM postoperatively.

The patient characteristics and US features of thyroid nodules in the training and validation cohorts are shown in Tables 1, 2 and 3. There were almost no significant differences in the characteristics between the two datasets, which justified their use as training and validation cohorts. Univariate analysis was conducted to determine differences in clinical and US characteristics between CLNM-positive and CLNM-negative groups. The agreement of ultrasound features between the two radiologists was satisfactory, with kappa coefficients between 0.81 and 0.92 (Supplementary Table 1).

Table 1 Clinical and US imaging characteristics of male patients in the training and validation datasets
Table 2 Clinical and US imaging characteristics of young female patients (≤ 35 years) patients in the training and validation datasets
Table 3 Clinical and US imaging characteristics of elder female patients (> 35 years) patients in the training and validation datasets

Ultrasound signature construction and diagnostic validation

LASSO regression analysis was performed to clarify the US imaging features as the strongest predictors, including tumour size and microcalcification, in the male and young female training cohorts. Tumour vascularity was also included for the elder female patients (Fig. 3a–f). A US signature containing these CLNM-related features was constructed based on the US score. The prediction performance of the US signature was good, which was then confirmed in the validation cohort (Fig. 5a–c).

Fig. 3
figure 3

LASSO coefficient profiles of the US features associated with CLNM (a, b, c), ultrasound feature selection by using the LASSO binary logistic regression model. Dotted vertical lines were drawn at the optimal values by using the minimum criteria and the 1 standard error of the minimum criteria (the 1-SE criteria) (d, e, f)

Development of the prediction model

Significant differences between CLNM-positive and CLNM-negative patients were observed for the US signature and clinical characteristics. After multivariate analysis, age, the US-reported LN status and the US signature remained independent predictors for CLNM, as shown in Table 4.

Table 4 Multivariate logistic regression analysis of risk factors for CLNM

Validation of the individualised prediction nomogram

The nomogram displayed good performance for predicting CLNM in the training cohort (Fig. 4a–f). Application of the developed nomogram in the validation dataset still displayed good discrimination in the male and young female cohorts, with AUCs of 0.813 (95% CI, 0.722–0.904) and 0.814 (95% CI, 0.712–0.915), respectively. (Fig. 5a, b) The sensitivity and accuracy for the prediction of CLNM were much better than those for US detection. Comparisons of diagnostic performance are shown in Table 5.

Fig. 4
figure 4

ROC curve of the US-reported LN status, US signature and combined model for predicting CLNM in the training dataset of male patients (a), young female patients (b) and elder female patients (c). Calibration curve of the combined model in the training cohorts of male patients (d), young female patients (e) and elder female patients (f). The x-axis represents the probability that the nomogram predicted CLNM, and the y-axis represents the actual rate of CLNM. The diagonal dashed line indicates ideal prediction by a perfect model, and the solid line represents the predictive power of the nomogram. The closer the solid line is to the dotted line, the better is the predictive power of the model

Fig. 5
figure 5

ROC curve of the US-reported LN status, US signature and combined model for predicting CLNM in the validation dataset of male patients (a), young female patients (b) and elder female patients (c). Calibration curve of the combined model in the validation cohorts of male patients (d), young female patients (e) and elder female patients (f)

Table 5 Diagnostic performance of the nomogram for predicting CLNM compared with US evaluation for LN status

The calibration curve for the nomogram yielded a nonsignificant statistic and suggested no departure from a perfect fit in the validation dataset. (p = 0.928 for males, p = 0.08 for young females) (Fig. 5d, e). The combined model was presented as a nomogram (Fig. 6a, b).

Fig. 6
figure 6

Nomogram for predicting CLNM in male (a) and female patients (b). The ultrasound nomogram was developed in the training cohort, with age, US-reported LN status and US signature incorporated. The different values for each variable correspond to a point at the top of the graph; points for all variables are added and translated into the probability of CLNM. Decision curve analysis of the ultrasound nomogram for male (c) and young female patients (d). The x-axis represents the threshold probability, and the y-axis represents the net benefit

However, for the elder female cohort, the performance of the prediction model was not sufficient, with an AUC of 0.742 (95% CI, 0.661–0.823). (Fig. 5c). The calibration curve showed a statistically significant difference from a perfect fit in the validation dataset (p = 0.03) (Fig. 5f).

Clinical use

Decision curve analysis of the nomogram is presented in Fig. 6c and d. The decision curve showed that if the threshold probability was > 0.21 for males or > 0.14 for females, using the nomogram to predict LN metastases added more benefit than either the treat-all-patients scheme or the treat-none scheme.

Discussion

In this study, we developed and validated an ultrasound-based model stratified by sex and age for predicting the probability of CLNM in PTC patients. The nomogram successfully stratified patients according to their risk of CLNM and yielded excellent performance especially in male and younger female cohorts. Different from published studies, the originality of our study is that the individual probability of CLNM can be evaluated preoperatively and noninvasively [22].

Young age has been recognised as an important risk factor for predicting LN metastasis and the recurrence of PTC [29, 30]. Xu et al [31] reported that younger age (≤ 36 years) was an independent clinical factor for predicting CLNM. Ito et al [32] also found that patients younger than 40 years showed more tumour growth than did older patients during active surveillance, with an age threshold similar to our research findings. An increasing number of recent studies have shown an association between male sex and aggressive PTC tumour behaviours [33]. We considered that the pathogenesis and biological behaviour of tumours in PTC patients may differ according to sex and age. Therefore, when the model was stratified by age and sex, its performance was much stronger than that of the previous unstratified model. We also noted that the validation results for females older than 35 years were unsatisfactory. Based on this result, we consider that for most of the elderly patients, the tumours were discovered accidentally instead of being actively surveilled, and thus the exact history of the tumours was unknown. As patients with a longer history of PTC who should have been in the younger group were included in the elder group, the stability of the prediction model was disrupted. We will explore deeper reasons in our future research.

In our study, a US signature was built using the strongest risk factors including tumour size and microcalcification for predicting CLNM. Tumours with a larger size on US examination were more likely to be associated with CLNM in our research, consistent with other reports [34, 35]. As a preoperative tool, the nomogram relies on a close correlation between sonographic findings and corresponding pathological results. Several studies have examined US findings of thyroid malignancies compared with pathologic results and found good agreement with regard to tumour size [36]. The discovery that merits discussion is the potential impact of microcalcification on CLNM, which was the strongest predictor in the US signature. A large number of punctate high echoes within thyroid nodules is of great predictive value for CLNM, especially with regard to the presence of diffuse distribution of microcalcification. In our study, nodules in 49 patients were characterised by diffuse microcalcification, and postoperative pathology identified CLNM in 91.8% (45/49) of these patients. To our surprise, CLNM was identified in 100% (12/12) of male patients and 96.1% (25/26) of young female patients. To the best of our knowledge, only a few studies have investigated the association between calcification and LN metastases; indeed, relevant research about the correlation between the distribution of microcalcification and CLNM has not been reported. Bai Y et al [37, 38] evaluated a group of PTC patients to determine the clinical significance of different types of calcification, finding that patients with psammoma bodies were more likely to have gross LN metastases, which is consistent with our findings, as psammoma bodies mostly represent microcalcification on US images [39]. Nonetheless, it should be noted that other pathological structures, such as focal fibrosis of nodular goitres, have an appearance similar to microcalcification on US images; therefore, attention should be paid to the identification of microcalcification.

In our research, CDFI was significantly different between the CLNM-positive and CLNM-negative groups; the richer the blood supply is, the higher is the probability of CLNM [40, 41]. Regardless, the evaluation of US for internal vascularity is unreliable and easily influenced by the operator and the machine. This may be why CDFI was not selected in the model building. In addition, multifocality was not included; it showed only limited statistical significance, largely because microscopic lesions smaller than 1 mm could be seen only by microscopic examination, and the identification of tumour multifocality is highly dependent on the radiologist.

Because of the heterogeneity in US image acquisition and clinical data collection in different institutions, we applied decision curve analysis [42] instead of multi-institutional prospective validation to justify its clinical usefulness.

Incorporating the US signature and clinical risk factors into an easy-to-use nomogram facilitates preoperative individualised prediction of CLNM. This nomogram may help to answer questions such as whether CLNM exists, and this may affect the surgical strategy. We suggest that patients with a high score are potential candidates for CLND. The clinical use of the nomogram may guide clinicians in stratifying patients and thereby avoid unnecessary surgery.

Despite the good results, there are still some limitations. First, the performance of our nomogram depends on the accuracy of operator-reported imaging features. The criteria used to evaluate the US signature were subjective. Nevertheless, the interobserver agreement for each feature in our study was good. Second, the pathological subtype of PTC was not taken into account. In addition, owing to its retrospective design, there was a potential for selection bias. Lastly, stringent external validation needs to be performed in larger, prospective multicentre clinical trials to obtain a more objective conclusion. The model may also be improved by the addition of more useful technologies such as elastography and computer-aided diagnosis system [43, 44], and we intend to investigate this in the future.

Conclusion

In conclusion, this study presents a nomogram based on clinical characteristics and US imaging features, and this easy-to-use scoring system can be conveniently applied to facilitate preoperative individualised prediction of CLNM in PTC patients, which is in line with the current trend towards personalised treatment.