Introduction

Thyroid nodules are common gland abnormalities that can be found in up to 65% of the general population and are mostly benign and clinically silent. This high prevalence reflects a diagnostic detection bias associated with the growing use of US for the evaluation and follow-up of neck structures, including the thyroid. In the presence of an (incidental) thyroid nodule, a main objective related to its management is to identify the subgroup of clinically significant thyroid tumors, accounting for 5–10% of cases [1].

Cause most thyroid nodules are asymptomatic or incidentally diagnosed by thyroid ultrasound, several scientific societies have proposed diagnostic algorithms for an initial stratification of the malignancy risk [2,3,4,5,6], but none of them has been sufficiently standardized [7,8,9]. As a consequence, US-guided fine-needle aspiration (FNA) biopsy of thyroid nodules is currently an overused practice resulting in nonselective extensive sampling of benign lesions that might not require this procedure or further clinical attention [10, 11].

In this retrospective study on patients with thyroid nodules who underwent surgical resection, we aimed to evaluate the accuracy in predicting malignancy of three different international US stratification systems, e.g., the 2015 American Thyroid Association (ATA) guidelines, the American Association of Clinical Endocrinologists, American College of Endocrinology, and Associazione Medici Endocrinologi Medical Guidelines (AACE/ACE/AME) guidelines, and the European Thyroid Association guidelines (EUTIRADS) [2, 5, 6], highlighting the main critical issues of thyroid nodule diagnostic algorithms.

Materials and methods

This retrospective observational study included 146 consecutive patients who were referred to our Center for FNA cytology for suspected thyroid nodules and then underwent thyroid surgery.

All data were retrieved from the Città della Salute e della Scienza University Hospital in Turin from January 2015 to September 2018. The Institutional Review Boards of our hospital approved this study. Since the purpose of this study was to evaluate the reliability of US scores in predicting malignancy of thyroid nodules by comparing US scores outcomes with both cytological and histological diagnoses, cases undergoing surgery with a non-diagnostic cytology at FNA were excluded. Moreover, the small number of nodules with this cytological report would have been too low (7 cases) for an accurate analysis.

To minimize the detection bias, histological and cytological materials, as well as the clinical charts and ultrasonographic records, were anonymized by a staff person not involved in this project, and only coded data were used for microscopic review and statistical analyses.

Ultrasound features

Both transverse and longitudinal sonograms were obtained by real-time imaging of the thyroid nodules using an Esaote MyLab Twice real-time US system with a linear multifrequency (7–14 mHz) probe. The still sonographic images were independently reviewed by two board-certified radiologists (R.G. and S.G.) and two endocrinologists (RR and LP) with >10 years of experience. In case of discordance, a mutual agreement was achieved after discussion and review of the video clips filmed during the FNA procedure.

The sonographic findings were analyzed by assigning the examined features to each category of the Ultrasound Risk Stratification Systems (US-RSS) based on the ATA [2], EUTIRADS [6], and 2016 AACE/ACE/AME [5] guidelines. Diameters (anteroposterior, transverse, and longitudinal) of each thyroid nodule were measured in millimeters (mm). Regarding the echostructure/composition of the nodules, they were classified as solid -predominantly solid (cyst ≤ 10% and cystic ≤ 50%), cystic- predominantly cystic (cyst > 50%) and spongiform (nodules containing multiple small cysts smaller than 5 mm interspersed within the solid tissue component [1, 3, 12].

The echogenicity of the nodules was classified as markedly hypoechoic, hypoechoic, isoechoic, anechoic or hyperechoic. A marked hypoechoic lesion was defined as a thyroid nodule that showed a relatively hypoechoic pattern compared to the adjacent strap muscles of the neck. Nodular margins were categorized and defined as ill-defined or smooth. Calcifications were subdivided into microcalcifications (defined as calcifications that were equal to or less than 1 mm and visualized as tiny punctate hyperechoic foci, either with or without acoustic shadows) and macrocalcifications (defined as hyperechoic foci larger than 1 mm). Nodule shape-also referred as nodule orientation along the longitudinal axis-was divided into parallel or not parallel (taller than wide). In this study, the taller-than-wide shape, highlighted in only two cases, was assessed by means of measurements and was defined in cases where anteroposterior (AP) diameter exceeded the transverse (T) diameter [13].

According to the US-RSS of the AACE/ACE/AME 2016 system only, we evaluated vascularization by color and power Doppler examination and stiffness by qualitative elastography. In our study, we have carried an elastographic strain evaluation. The pressure is exerted freehand through the ultrasound transducer. An elastographic image (elastogram) is then produced, represented as a color-coded image superimposed on the image in mode B; the two images are displayed side by side on the screen.

The elastogram provides a mapping of the stiffness of the nodule considered in each position and allows for a qualitative assessment [14, 15].

By evaluating the color pattern prevalent within the nodule, it is possible to qualitatively compare the result obtained with a progressive reference score, among those proposed by the literature. In our study, we used the classification proposed by Rago et al. [16]. We defined the nodules with patterns 1 and 2 proposed by Rago as soft; the nodules corresponding to pattern 3 were classified at intermediate elasticity and finally the nodules with presentation patterns 4 and 5 were considered hard. If an intra-nodular vascularization pattern or elevated stiffness were documented, the nodule was classified as at intermediate risk.

All patients underwent total thyroidectomy or hemithyroidectomy, with an ultimate histological definition of the nodule nature.

According to recently published data [17], in our study, we interpreted the ultrasound data of each US-RSS into the following categories:

  • Low risk nodule: low risk according to the 2016 AACE/ACE/AME guidelines; benign, very low suspicion and low suspicion according to the 2015 ATA guidelines; and classes 2–3 according to 2017 EUTIRADS.

  • Intermediate risk nodule: intermediate risk according to the 2016 AACE/ACE/AME guidelines; intermediate suspicion according to the 2015 ATA guidelines; and class 4 according to 2017 EUTIRADS.

  • High risk nodule: high risk according to the 2016 AACE/ACE/AME guidelines; high suspicion according to the 2015 ATA guidelines; and class 5 according to 2017 EUTIRADS.

US-guided FNA

After the ultrasound characterization, the FNA procedure followed. Samples were obtained with 21/23-gauge needles, and a range of 1–4 passes was performed depending on the location of the lesion; most passes were performed together with a rapid on-site evaluation (ROSE). The aspirated material was smeared onto two slides (one fixed in 95% ethanol and stained with a rapid hematoxylin-eosin (HE) stain for ROSE and the other air dried for Giemsa stain), and the excess material was placed in a 95% alcoholic solution for cell block preparation. For each cell block, two sections were prepared for routine (HE) and Papanicolaou (PAP) staining. Each case was then classified according to the criteria published in the Italian Consensus for Thyroid Cytopathology (SIAPEC-IAP) as follows: Tir1—nondiagnostic; Tir1c—nondiagnostic/cystic; Tir2—benign; Tir3A—indeterminate (low-risk lesion); Tir3B—indeterminate (high-risk lesion); Tir4—suspicious of malignancy; and Tir5—malignant [18].

Specimen interpretation

All cytological and histological specimens were reviewed by three pathologists trained in thyroid histo- and cyto-pathology (F.M., D.P., and M.P.). Each reviewer was blinded to the other judgments. Discordant cases were jointly discussed under a multi-head microscope, and a consensus was reached.

Statistical analyses

All nodules were retrospectively classified using the US criteria of three international guidelines [2, 5, 6]. The risks of malignancy (ROM) associated with each category were calculated as percentages. All nodules were dichotomized into two groups: those for which US-guided FNA was or was not indicated by the FNA criteria. Thyroid nodules were dichotomized into two groups based on the recommendations (including the dimensional criterion) for biopsy by each guideline, that is, US-guided biopsy indicated or not indicated. The diagnostic performances of all US criteria in terms of identifying thyroid cancer were evaluated. The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy of each set of guidelines were calculated. The agreement between each US-RSS was then evaluated by the Cohen kappa test. The kappa coefficient was interpreted as follows: 0.00–0.20, poor agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, good agreement; and 0.81–1.00, excellent agreement.

Continuous variables are presented as mean ± standard deviation and were evaluated using the Student’s t-test. Categorical variables were analyzed using the chi-square test. The results were considered statistically significant if the P value was less than 0.05. All statistical analyses were performed with STATA IC 10 (STATACORP, LP, Texas, USA) analytic software.

Results

Demographic and sonographic characteristics of thyroid nodules

Of the 146 patients included in our study, 111 were females (76.0%); the mean age was 50.5 ± 14.8. The mean nodule size was 26.2 ± 16.4 mm (for each patient, a suspicious nodule was evaluated).

According to the histological diagnosis, 78 lesions were benign and 68 malignant, considering also Non Invasive Follicular Tumor with Papillary-like nuclear features (NIFTP) and well differentiated tumor of uncertain malignant potential (WDT-UMP) as malignant lesions.

The patients with histological-confirmed malignant nodules were younger than those with benign nodules (47.8 vs 52.6 years, P = 0.05).

As summarized in Table 1, the malignant nodules were smaller than the benign nodules (19.8 ± 12.5 vs 31.8 ± 17.4 mm, P < 0.001) and overall had a solid pattern. Sonographically, malignant thyroid nodules showed a more marked hypoechogenicity (P < 0.001), irregular margins (P < 0.001), microcalcifications (P < 0.001), intranodular vascularization (P = 0.05) and elevated stiffness (P < 0.001) than the benign nodules.

Table 1 Ultrasonographic features of 146 thyroid nodules

Cytological and histological reports

In our series, 68 nodules (46.6%) were histologically proven to be malignant based on the resected specimens (54 papillary carcinomas, 10 follicular carcinomas, 1 medullary carcinoma, 1 NIFTP and 2 WDT-UMP). Benign lesions included 44 goiters, 1 thyroiditis and 33 follicular adenomas.

Regarding FNA cytological findings, 116 nodules (79.5%) were classified into high risk categories (Tir3B, 82 cases; Tir4, 9 cases; and Tir5, 25 cases), resulting in 65 malignant (56.0%) and 51 benign lesions (44.0%). The remaining 30 nodules were classified into low-risk cytological categories (Tir2, 25 cases, and Tir3A, 5 cases), 27 of them (90.0%) with a benign final histological report and 3 (10.0%) with a malignant one (Table 2).

Table 2 Stratification of cytologic FNA score based on surgery indication together with the comparison between cytologic FNA score and histology report

The risk of malignancy (ROM) according to the different cytological classes was 0% for Tir2, 15.4% for Tir3A, 35.4% for Tir3B and 100% for Tir4 and Tir5. Using the chi-square test with Yates correction, the results of the cytological reports were statistically correlated with the histological diagnoses (χ2:21.1769; P < 0.001).

FNA cytology classified macrofollicular goiters (90.9% in Tir2 category) and classical papillary carcinomas (100% in TIR3B, 4 and 5 categories) with high accuracy. On the other hand, microfollicular goiters (especially those with an oxyphilic component) and variants of papillary carcinomas, as well as follicular adenomas and carcinomas caused major diagnostic problems since they were usually classified in the TIR3A-3B category. In case of indeterminate lesions (TIR3), the EUTIRADS and ATA scores performed similarly to the cytological classification in defining most TIR3A nodules (69.2%) as low risk and the majority of TIR3B nodules (70.7%) as intermediate/high risk. The AACE/ACE/AME performance seemed influenced by the relatively small number of lesions classified as low risk. Consequently, this system had a poor performance in defining TIR3A low risk lesions. About discordant cases, FNA cytological evaluations were unable to identify malignant lesions in two cases classified as TIR3A. These lesions turned out to be a follicular variant of papillary carcinoma and a Hurthle cell carcinoma, both characterized by minimally aggressive features. The first was classified as carcinoma because of the presence of papillary carcinoma type nuclei and of psammoma bodies, while the latter had only a single focus of capsular invasion corresponding to a diagnosis of minimally invasive Hurthle cell carcinoma.

When considering TIR3B nodules, the likelihood of a histologically malignant result in low ultrasound-risk categories (6/23, 26.1%) was similar to that observed in intermediate-risk lesions by the three US-scoring systems (EUTIRADS: 9/36, 25.0%; ATA: 11/42, 26.2%, AACE/ACE/AME: 17/60, 28.3%).

Ultrasound risk stratification systems (US-RSSs)

The percentage of nodules that would have undergone FNA biopsy was 84.25% according to ATA, 69.86% according to EUTIRADS, and 64.38% according to AACE/ACE/AME score system. In Table 3, the frequencies and malignancy risks of all categories of the three guidelines are reported.

Table 3 Risk of malignancy of the three ultrasound risk stratification guidelines

In relation to the histological results, the ATA score was more sensitive as it missed fewer carcinomas than the other systems (18 with ATA vs 22 with EUTIRADS and 26 with AACE/ACE/AME). Conversely, the EUTIRADS achieved a higher specificity and would have prompted less biopsies (ATA score 84.25% versus EUTIRADS score 69.86%) but maintaining almost the same efficacy as the ATA score (Table 4).

Table 4 Diagnostic performances of three ultrasound scoring systems compared with cytological and histological findings

To evaluate the agreement between the results of the three scoring systems in the thyroid nodules, we measured Cohen’s kappa coefficient among the three systems, using two US-RSSs at the time. Each US-RSS demonstrated a good or moderate agreement with each other. However, AACE/ACE/AME and EU-TIRADS showed the highest US-RSS concordance (κ = 0.75, P < 0.001).

We then evaluated the ability of the three US-RSSs to discriminate between low suspicion (TIR 2, TIR3A) and high suspicion nodules for malignancy (TIR3B, TIR4 and TIR5) in relation to histology. Low-risk nodules resulted histologically benign in most cases according to both the ATA and EUTIRADS systems (78.8%), while in all cases (100%) according to the AACE/ACE/AME system. However, the latter system classified as low risk only 6/146 lesion, likely explaining the high accuracy described. Remarkably, all lesions classified as very low risk by the ATA and low-risk by the AACE/ACE/AME or as class 2 by EUTIRADS systems were in fact histologically benign. The category of low-risk nodules by the ATA and class 3 by EUTIRADS contained a few false-negative cases, including 6 papillary thyroid cancers (all classified as a low risk of recurrence according to the 2015 ATA system), 2 follicular thyroid cancers (one minimally invasive FTCs with capsular invasion only, the other completely cystic), 1 NIFTP and 2 WDT-UMP at surgery.

When nodules belonging to the US-RSS high suspicion categories were analyzed, the ROM was 94.9% by ATA, 87.0% by EUTIRADS and 86.0% by AACE/ACE/AME systems. In this category, in relation to phenotypes, ATA misquoted 2 cases as benign (1 goiter, 1 follicular adenoma) out of the 39 histologically-proven malignant lesions, while AACE/ACE/AME and EUTIRADS each misquoted 6 benign cases (3 goiters and 3 follicular adenomas) out of 43 and 46 malignant lesions, respectively.

Our analysis of nodules with an intermediate risk showed that these were histologically proven as benign in 35/55 cases (63.6%) according to ATA, 66/97 (68.0%) according to AACE/ACE/AME, and 31/48 (64.6%) according to EUTIRADS systems. When this category in the ATA guideline modeled >20 mm instead of 10 mm as the size for biopsy, the unnecessary biopsy rate decreased to 57.0% compared with 94.3%, increasing specificity from 5 to 43%. For the EUTIRADS and AACE/ACE/AME guidelines, the application of the dimensional criterion >20 mm would allow to have a specificity of 39 and 33%, respectively, with a sensitivity of 53 and 58%.

In the present series, a very small proportion of cases (1 NIFTP, 2 WDT-UMP) had ultrasound features consistent with a low malignancy suspicion (predominantly solid, isoechoic pattern, regular margins and no micro- or macrocalcifications) and were therefore included in the following categories: low-risk by ATA, intermediate by AACE/ACE/AME, and class 3 by EUTIRADS systems. In all cases, FNA was performed because of the dimensional criterion (>20 mm), and the cytological results were TIR3B for the histologically proven NIFTP and TIR2 and TIR3B for the two WDT-UMP tumors.

The ATA, EUTIRADS and AACE/ACE/AME sonographic stratification systems failed to identify 18, 22, and 26 histologically malignant nodules, respectively. As it can be seen in Table 5, the diagnostic performance, and false-negative rates of the three systems appeared to be influenced by the cutoff of the nodule size selected for biopsy recommendation.

Table 5 Characterization of histologically malignant nodules not indicated for FNAB according to three guidelines

Discussion

Our study compared the US-scoring systems, combined with cytological characterization and post-surgical histological diagnoses in a relatively large series of consecutive thyroid nodules undergoing surgery.

Although a good agreement was observed among the results of the three systems, our findings showed that each US-RSS has its own peculiarity. The ATA scoring system has higher sensitivity than the AACE/ACE/AME and EUTIRADS classifications, while specificity was lower. The ATA score could perform better because it requires a second parameter beyond the hypoechogenic pattern, reducing overestimation of the other two US-RSS system. Our recent study has demonstrated the applicability of a score (U score) that combines the presence of at least two US features of suspected malignancy to proceed to FNA, regardless of the specific predictive value of each US feature [19]. When the accuracy of these scoring systems was assessed through the subsequent histological outcome, the ATA scoring system showed an accuracy of 38% in differentiating malignant from benign conditions as compared to the EUTIRADS classification, which retained a better accuracy of 47%, with a 29% specificity and 68% sensitivity. These EUTIRADS values appear like those obtained by the AACE/ACE/AME scoring system.

We showed that the categorization of low- and high-risk thyroid nodules using current US-RSSs, regardless of the lesion dimension, helps to determine the optimal treatment option. On the other hand, for intermediate-suspicion categories, a thyroid nodule size cutoff >20 mm for biopsy appeared to improve diagnostic accuracy, according to all the three guidelines. In fact, in our series US-RSS systems seem to provide a reliable stratification of low- and high-risk lesions; specifically, when the risk score of a nodule was low, the overall risk of malignancy varied from 0%, according to the AACE/ACE/AME system, to 21.2% (including NIFTP and WDT-UMP) according to the ATA and EUTIRADS. For these categories, ultrasound follow-up should be recommended because of the expected indolent behavior. Moreover, in the high-risk category, the risk of malignancy was higher than 85% based on US features only. The nodule size might affect the subsequent diagnostic-therapeutic decision. The surgical resection can be considered in case of nodules >10 mm. In the case of subcentimeter nodules with high-risk US features, although the most recent ATA guidelines do not recommend diagnosing low-risk papillary microcarcinoma based on cytology [2], recent data showed that it is better to cytologically diagnose suspicious nodules as papillary microcarcinoma and clearly disclose a diagnosis of carcinoma to patients [20, 21]. We also think that FNA should be more often performed. Without the diagnosis of microcarcinoma, it can be very difficult to persuade patients to undergo a regular checkup. After the diagnosis of microcarcinoma without any high-risk features (as pathological lymph nodes) we offered two management options, active surveillance and surgery, and we asked patients which option they would prefer. However, we recommend active surveillance as the first-line management because of the favorable data regarding active surveillance. These results prompted us to consider US parameters only, regardless of the lesion dimension, in the US ultrasonographic risk stratification, in order to identify nodules that require FNA.

The accuracy remains modest for lesions belonging to the intermediate risk class. Nodules scored as intermediate risk are challenging, and we showed that this is especially true for the AACE/ACE/AME system. This uncertainty possibly results from the inclusion in this category of follicular carcinomas (normally isoechoic or hypoechoic without other ultrasonographic criteria). In our study, about 63% of the intermediate-risk nodules were diagnosed as benign nodules by surgery. Intuitively, these patients should not be immediate candidate to surgery [22], and the dimensional cut-off size [23] may only help to discriminate the need for surgery in patients with potentially benign lesions. In such cases, clinical decisions on surgery can be made after considering the coexistent clinical factors and patient’s preferences. Nevertheless, the risk of overdiagnosis of benign nodules and the delayed diagnosis of a follicular thyroid carcinoma remain to be considered.

Moreover, the proposed scoring parameters appeared to be more accurate for papillary cancers than other histological types. In fact, the unsuspicious ultrasonographic presentation of FTC, as well as the recognized limitations of cytological assessment to detect it, represent an important problem in intermediate risk classes; unfortunately, it is known that FTC can be underestimated and accounts for the majority of the false negatives of all stratification systems [24]. However, in our retrospective series, follicular carcinomas masked themselves also in low-risk ATA lesions or in the EUTIRADS class 3. These nodules underwent surgery due to the large size (mean size was 29.8 mm versus 19.73 mm of all carcinomas).

Our series showed a risk of malignancy in the cytological categories comparable with that reported in the 2017 Bethesda System for Reporting Thyroid Cytopathology and in the Italian consensus for the classification and reporting of thyroid cytology [18, 25], although it was slightly higher in indeterminate and high-risk lesions (TIR3 and higher). This discrepancy could be explained by the selection bias of our series, which only comprised surgically treated patients. A good concordance between cytological and histological reports was observed [18], either after considering or not NIFTP and WDT-UMP as malignant lesions. These are two relatively new entities sharing both benign and malignant cytomorphological features but are generally associated with indolent behavior and have an excellent prognosis, requiring no further treatment after surgery [26,27,28]. The performance of these systems in borderline conditions (WDTUMP and NIFTP) is unknown. In our hands, these lesions showed mild suspicious US features, but further cases need to be investigated to validate the data.

Focusing on indeterminate lesions, the risk of malignancy for TIR3A was 15.4%, while that for TIR3B was 37.8% if considering NIFTP as a malignant or 35.4% if considering NIFTP as benign lesions. The percentage of NIFTP reported in our series (2.4%) is lower than expected (mean 10–15%) [29,30,31], and we are inclined to speculate that this low figure may depend on our inclusion of NIFTPs as derived from clinically suspected nodules, while those incidentally found in histological specimens were not included. In all, this evidence suggests a good predictive ability of cytological classifications in defining the risk of malignancy for thyroid nodules.

With regard to nodules classified as TIR3B, our data fit perfectly into the 15–30% probability of malignancy reported by SIAPEC [18]. Moreover, the risk of US score increased towards high risk, (87.5%, 72.7%, 70.0% for the ATA, EU-TIRADS, and AACE/ACE/AME system, respectively). In summary, the use of US-RSS in combination with cytological results is expected to enable personalized treatment (US surveillance, repeat biopsy, diagnostic surgery) for cases of thyroid nodules with TIR3.

Our study has several limitations that must be considered when interpreting our findings. In particular, the retrospective nature, the analysis of US features conducted on static US images and the possibility of selection bias because only patients who underwent surgical resection were included. Therefore, the malignancy rate of the nodules in our study is higher than that of the general population.

In conclusion, clinicians should be aware of the strengths and weaknesses of each of the ultrasound score systems in the management of thyroid nodules [32]. The diagnostic performance and unnecessary biopsy rates of various guidelines are influenced by nodule size cutoff for biopsy. Regardless of the scoring system that the clinician decides to use, the results of our study could help to simplify the decision-making process regarding the management of low-risk nodules. In addition, in high risk categories, the US-RSS performs better when evaluating at least two ultrasound risk features. Finally, the combined use of ultrasound features, a dimensional criterion and the cytological results should allow for personalized treatment (US surveillance, biopsy repetition, diagnostic surgery) for cases of intermediate-risk thyroid nodules.

Further validation of the malignancy risk of each category is required in a larger, prospective, longitudinal study that evaluates the outcomes and costs regarding adjustment in size thresholds in potential revisions of US-based risk stratification systems.