Introduction

Ultrasonography (US) has been used as an essential diagnostic tool in assessing thyroid nodule characteristics and the selection of nodules for fine-needle aspiration biopsy (FNAB) [1]. Specific US characteristics have been proposed to be associated with thyroid malignancy such as microcalcifications, hypoechogenecity, irregular margins, taller-than-wide shape, and intranodular vascularity. However, none of them allows to reliably distinguish malignancy from benign nodules [2]. Thus, efforts have been made to develop US-based malignancy risk stratification systems for thyroid nodule. In 2009, Horvath was the first expert to propose Thyroid Imaging Reporting and Data System (TIRADS) based on ten US patterns, taking Breast Imaging Reporting and Data System (BIRADS) as a model [3]. However, these sonographic patterns are not applicable to all types of thyroid nodules [4]. Recently, Korean Society of Thyroid Radiology (KSThR), American College of Radiology (ACR), and European Thyroid Association (ETA) have published three editions of TIRADS in succession [5,6,7]. In 2016, KSThR has revised the recommendations for the US diagnosis and imaging-based management of thyroid nodules on the basis of the original edition published in 2011, stressing the risk stratifications based on solidity and echogenecity of thyroid nodules [5]. Then in 2017, ACR developed ACR-TIRADS by the sum of points of every characteristic feature and the total point determines the nodule’s ACR TI-RADS level, which ranges from TR1 (benign) to TR5 (high suspicion of malignancy) [6]. In the same year, ETA created the novel European Thyroid Imaging and Reporting Data System (EU-TIRADS) consisting of five categories based on different patterns and US features [7]. However, as the representatives of thyroid association worldwide, the diagnostic value of the three newly released international US risk stratification systems has not been well validated. The present study aimed to evaluate the three newly-published TIRADS from KSThR, ETA, and ACR based on our large sample database and compare their diagnostic efficiency for better application in clinical practice.

Materials and methods

Subjects

From January 2014 to October 2017, a consecutive of 3210 lesions that underwent thyroid US examination and FNA and/or surgery from three tertiary hospitals around JiangSu Province were enrolled in this study. Patients who met all the following criteria by reviewing US patterns and clinical data were included in this study: (a) nodules with definite histopathology results, (b) nodules with complete Bethesda system for reporting thyroid cytopathology (BSRTC) results. The exclusion criteria were that: (a) nodules without postoperative pathology except for BSRTC II cytology, (b) nodules of BSRTC II cytology whose US follow-up interval less than one year or during which increase in size or change in US features (Fig. 1). The increase in size was defined as more than a 50% change in volume or a 20% increase in at least two nodule dimensions with a minimal increase of 2 mm in solid nodules or in the solid portion of mixed cystic-solid nodule [8]. Finally, 2031 patients with 2465 thyroid nodules were enrolled in this study, which included 415 male and 1616 female patients. The mean age of the patients was 47.70 ± 13.38 years. The study was performed in accordance with the ethical guidelines of the Helsinki Declaration and approved by the local ethics review committee (2012-SR-058).

Fig. 1
figure 1

Diagram of the study group

US examination technique

All US images were obtained by using a 4–13 MHz linear array transducer. The scanning protocol in all cases included both transverse and longitudinal real-time imaging of the thyroid nodules. Designated radiologists from three centers were asked to assess the thyroid nodules using one set of standards according to published literature [9]. The features used in analysis included size, composition (cystic/mixed cystic and solid/solid), echogenicity of solid portion (anechoic/isoechoic/hyperechoic/hypoechoic), echotexture (homogeneous/heterogeneous), vascularity (TypeI/II/III), shape (wider than tall/taller than wide), margin (well-defined/Ill-defined/irregular) and calcification (absent/comet-tail artifacts/macro/eggshell/micro). One specialist from each center extracted US features based on static US patterns and description of features and then input these features into database. Finally one experienced radiologist in thyroid imaging did all classifications according to the database. Clinical information and pathology results were blinded to the radiologist.

KSThR-TIRADS classification

All nodules were scored based on patterns and US features of KSThR-TIRADS as followed (Fig. 2) [5]. Category 2: Spongiform; Partially cystic nodule with comet tail artifact; Pure cyst. Category 3: Partially cystic or iso/hyperechoic nodule without any of the three suspicious US features (Microcalcification, nonparallel orientation, spiculated/microlobulated margin). Category 4: Solid hypoechoic nodule without any of the three suspicious US features; Partially cystic or isohyperechoic nodule with any of the three suspicious US features. Category 5: Solid hypoechoic nodule with any of the three suspicious US features.

Fig. 2
figure 2

Patterns categorized based on ACR-TIRADS, EU-TIRADS and KSThR-TIRADS: a TR1/EU-TIRADS 2/KSThR-TIRADS 2: almost completely cystic (nodular goiter with cystic degeneration). b TR2/EU-TIRADS 3/KSThR-TIRADS 3: mixed cystic and solid isoechoic without any suspicious feature (nodular goiter). c TR3/EU-TIRADS 3/KSThR-TIRADS 3: almost completely solid isoechoic without any suspicious feature (follicular adenoma). d TR4/EU-TIRADS 4/KSThR-TIRADS 4: solid hypoechoic without any suspicious feature (nodular goiter). e TR4/EU-TIRADS 5/KSThR-TIRADS 4: solid hyperechoic with taller-than-wide shape (papillary thyroid cancer). f TR5/EU-TIRADS 5/KSThR-TIRADS 5: solid hypoechoic with taller-than-wide shape (papillary thyroid cancer)

ACR -TIRADS classification

All nodules were scored based on ACR-TIRADS as follows (Fig. 2) [6]: Composition (0 points: Cystic or almost completely cystic/Spongiform; 1 point: Mixed cystic and solid; 2 points: Solid or almost completely solid); Echogenecity (0 points: Anechoic; 1 point: Hyperechoic or isoechoic; 2 points: Hypoechoic; 3 points: Very hypoechoic); Shape (0 points: Wider-than-tall; 3 points: Taller-than-wide); Margin (0 points: Smooth/ Ill-defined; 2 points: Lobulated or irregular; 3 points: Extra-thyroidal extension); Echogenic Foci (0 points: None or large comet-tail artifacts; 1 point: Macrocalcification; 2 points: Peripheral calcifications; 3 points: Punctate echogenic Foci). The total points were calculated to determine TI-RADS level. TR1 (0 points), TR2 (2 points), TR3 (3 points), TR4 (4 to 6 points), TR5 (7 points or more).

EU-TIRADS classification

All nodules were scored based on patterns and US features of EU-TIRADS as follows (Fig. 2) [7]. EU-TIRADS 2: Pure cyst; Entirely spongiform. EU-TIRADS 3: Ovoid, smooth isoechoic/hyperechoic and no features of high suspicion. EU-TIRADS 4: Ovoid, smooth, mildly hypoechoic and no features of high suspicion. EU-TIRADS 5: At least one of the following features of high suspicion including irregular shape, irregular margins, microcalcifications, marked hypoechogenicity (and solid).

Statistical analysis

Statistical analysis was performed using SPSS 20.0 software (SPSS Inc., Chicago, USA). All quantitative values were expressed as means ± SD. Differences in the value of continuous variables were evaluated by non-parametric test. Differences in the distribution of categorical variables between groups were evaluated by the two-tailed Chi-square (χ2) test or Fisher exact test. According to the final diagnosis (pathology or follow-up results), the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated for each method. Receiver Operating Characteristic (ROC) curve analysis and area under curves (AUC) with MedCalc 11.4.2.0 software (MedCalc Software, Ostend, Belgium) were used to compare the diagnostic value of the three models and to determine the optimal cut-off value between benign and malignant nodules. Paired Chi-square test was used to compare the sensitivity and specificity of each two models. AUC and P value were calculated. P< 0.05 was considered significant in all tests.

Results

Clinical and US profile

A total of 2465 thyroid nodules in 2031 patients were included in our study. The mean diameter of the nodules was 16.63 ± 11.78 mm. Among the nodules, 505 benign nodules and all 1005 malignant lesions were confirmed by histopathology. The remaining 955 benign lesions were diagnosed based on the benign cytology and follow-up ultrasound. The epidemiological, clinical data of studied cases between malignant and benign groups were shown in Table 1. The average age in benign group was older than that in malignant group (P < 0.05). Besides, malignant lesions tended to be single in nodularity and smaller in size (P < 0.05 for both). US features of benign and malignant nodules are illustrated in Table 2. As determined by χ2 tests, the difference between echogenecity, composition, echotexture, margin, shape, vascularity, calcification was statistically significant (P < 0.05 for all).

Table 1 Clinical features of the study population and basic characters of the nodules
Table 2 Ultrasound features of the nodules according to final diagnosis

Correlations between TIRADS classification and final diagnosis

All 2465 nodules could be sorted based on KSThR-TIRADS, EU-TIRADS, and ACR-TIRADS classification. As for KSThR-TIRADS based on histopathology or follow-up results, the malignancy rates of nodules categorized as 2 to 5 were found to be 2.8% (four of 141 nodules), 5.1% (35 of 687 nodules), 33.7% (248 of 735 nodules) and 79.6% (718 of 902 nodules), respectively, with significant differences among categories (P < 0.001) (Table 3).

Table 3 The malignancy rates of KSThR-TIRADS, ACR-TIRADS, and EU-TIRADS according to final diagnosis

For EU-TIRADS, the risk of malignancy significantly rose as the TIRADS category increased. The risk of malignancy was found to be 0 (0 of 62 nodules) in category 2, 3.1% (19 of 607 nodules) in category 3, 22.8% (150 of 659 nodules) in category 4 and 73.5% (836 of 1137 nodules) in category 5, respectively, with significant differences among categories (P < 0.001) (Table 3).

According to histopathology or follow-up results, distribution of carcinomas among ACR-TIRADS categories was 0% (0 of 62 nodules), 2.3% (10 of 426 nodules), 7.5% (24 of 319 nodules), 40.1% (368 of 917 nodules), and 81.4% (603 of 741 nodules), respectively (Table 3).

Comparison of the three TIRADS editions in diagnostic value

The ROC curves of three TIRADS editions were listed in Fig. 3. As shown in Table 4, among the three editions of TIRADS, KSThR-TIRADS had the highest AUC (0.855) and specificity (87.4%), while lowest (71.4%) sensitivity. ACR-TIRADS showed best sensitivity (96.6%) with lowest specificity (52.9%) and the AUC (0.846) was slightly lower than that of KSThR-TIRADS. Compared to ACR-TIRADS, EU-TIRADS had relatively lower AUC (0.843), while the specificity was significantly higher (79.4%). The statistical differences of AUC, sensitivity and specificity between every two US risk stratification models were shown in Table 5. Difference between every two models in sensitivity and specificity was significant (P < 0.05). In terms of AUC, the differences between KSThR-TIRADS and EU-TIRADS were statistically significant.

Fig. 3
figure 3

ROC curves of KSThR-TIRADS, EU-TIRADS, and ACR-TIRADS based on final diagnosis

Table 4 The comparison of three editions of TIRADS in diagnostic value
Table 5 Statistical results of three TIRADS editions in diagnostic value between groups

The comparison of three TIRADS editions in FNAB criteria

Among 2465 thyroid nodules, total 1383 (56.1%), 1120 (45.4%) and 921 (37.4%) FNABs were recommended according to FNAB criteria by KSThR, ETA and ACR, revealing 592 (42.8%), 498 (44.5%) and 494 (53.6%) malignant lesions, respectively (ACR vs ETA, ACR vs KSThR, P < 0.001, ETA vs KSThR, P > 0.05) (Table 6). Whereas total 413 (38.2%), 507 (37.7%) and 511 (33.1%) thyroid cancers would be missed among Non-FNABs (ACR vs ETA, ACR vs KSThR, P < 0.001, ETA vs KSThR, P > 0.05). The rate of unnecessary FNAB was lowest with the ACR guidelines (17.3%), followed by EU-TIRADS (25.2%) and KSThR-TIRADS (32.1%) (P < 0.001 for all). The false-positive rate was lowest with the ACR guidelines (29.2%), followed by EU-TIRADS (42.6%) and KSThR-TIRADS (54.2%) (P < 0.001 for all). The false-negative rate was highest with the ACR guidelines (50.8%), followed by EU-TIRADS (50.4%) and KSThR-TIRADS (41.1%) (ACR vs KSThR, ETA vs KSThR, P < 0.001, ETA vs ACR, P > 0.05).

Table 6 Diagnostic performance of three TIRADS editions in FNAB criteria

Discussion

The present validation study has revealed that all these newly-released editions of TIRADS from KSThR, ETA, and ACR have shown great value in predicting thyroid malignancy. Among them, KSThR-TIRADS performed remarkably best in differentiating malignancy form benignity, while ACR-TIRADS and EU-TIRADS showed their own advantages in diagnostic sensitivity or specificity more or less. As for FNAB criteria, ACR-TIRADS showed the lowest rate of unnecessary FNAB and highest rate of malignancy in FNAB.

In 2009, Horvath was the first one to develop TIRADS for cancer risk determination, just as BIRADS did for breast lesions [3]. Then, Park et al. [10]. proposed an equation for predicting the probability of malignancy based on 12 ultrasound features, however, the complexity restricted its clinical practice. In 2011, Kwak et al. [11]. published a TIRADS classification based on five suspicious ultrasound features. However, the same weighting neglected their different contributions to a malignant lesion. In 2016, American Thyroid Association (ATA) and subsequently the American Association of Clinical Endocrinologists (AACE), American College of Endocrinology (ACE), Associazione Medici Endocrinologi (AME) constructed new ultrasound risk stratification models according to sonographic patterns [12, 13]. Our previous study revealed that thyroid nodule sizes influenced the diagnostic performance of Kwak-TIRADS and ATA ultrasound patterns [14].

Recently, KSThR, ACR, and ETA have published three editions of TIRADS in succession to help determine the nature of nodules and facilitate the selection of nodules for FNA cytological analysis. As for KSThR-TIRADS, the benign category included the pattern of pure cystic, spongiform, and partially cystic only when accompanied with comet tail, which were confirmed benign in other studies [15, 16]. The malignancy risk of category 4 in our study reached 33.7% once the partially cystic or isohyperechoic patterns in category 3 were accompanied by suspicious US features, or solid hypoechoic nodules without suspicious features. In Yoon’s study, the malignancy risk of hyper- to isoechoic solid or partially cystic nodules with suspicious features was 18.2% and that of solid hypoechoic nodule with smooth regular margin was 16.7%, both within the range of the recommended risk of category 4 [17]. Besides, 79.6% of nodules classified as category 5 were malignant in our study, indicating that the pattern of solid hypoechoic with any of the three suspicious features was highly suspicious for malignancy.

In 2017, ACR presented a white paper of ACR-TIRADS which ranges from TR1 (benign) to TR5 (high suspicion of malignancy) [6]. Unlike KSThR, ACR-TIRADS was based on total scores of five US features including composition, echogenecity, shape, margin, and echogenic foci. The malignancy rates in our study were 0, 2.3, 7.5, 40.1, and 81.4% from TR1 to TR5, which were similar to that in William’s recent multi-institutional analysis [18]. In ACR-TIRADS, features such as macrocalcification, rim calcifications and mixed cystic and solid were regarded as somewhat risk of malignancy. A nodule showing the pattern of solid hyperechoic accompanied with macrocalcification would be categorized as moderately suspicious by ACR-TIRADS, while it was classified as low suspicion by KSThR-TIRADS or EU-TIRADS. That’s why ACR-TIRADS showed highest sensitivity, yet lowest specificity among these three systems. However, the diagnostic AUC of ACR-TIRADS was 0.846, which was close to Ha’s results [19]. The AUC was relatively lower compared to KSThR-TIRADS.

In the same year, the ETA created the novel EU-TIRADS based on a review of the literature and on the AACE, ATA, and Korean guidelines [7]. Both based on US patterns, while compared to KSThR-TIRADS, EU-TIRADS put less emphasis on solidity. For example, one nodule showing the pattern of a combination of predominantly solid, mildly hypoechoic without suspicious features was categorized as low suspicion by KSThR-TIRADS, while it was classified as intermediate risk by EU-TIRADS. In a retrospective multi-center study, the malignancy rate of such pattern was 17.2%, which was close to the recommended risk of low suspicion by KSThR-TIRADS rather than intermediate risk by EU-TIRADS [20]. While the nodule with suspicious US features combined with solid hypoechoic would be regarded as high suspicion based on KSThR-TIRADS, it would be categorized as high risk by EU-TIRADS disregarding its echogenecity or internal content. Thus, KSThR-TIRADS showed a relatively lower sensitivity. In terms of diagnostic value, compared to EU-TIRADS and ACR-TIRADS, KSThR-TIRADS yielded a higher AUC and specificity at a cost of decreased sensitivity, indicating that KSThR-TIRADS may be diagnostically superior to the other two models in clinical practice in our population.

TIRADS could aid in decision making about the use of FNAB. Compared with other US systems, ACR-TIRADS criteria offered the lowest rate of unnecessary FNAB [21], which could allow reduction in the percentage of benign nodules that are biopsied [22]. In consistent with these results, we found that the malignancy rate among recommended FNABs was highest and the missed malignancy rate among Non-FNABs was lowest with ACR-TIRADS compared to the other two systems. A recent study also found that ACR-TIRADS outperformed EU-TIRADS and KSThR-TIRADS in identifying nodules whose FNA can be safely deferred [23]. It could classify over half the biopsies as unnecessary and the malignancy rate among these non-FNABs was 2.2%, lower than that of its competitors. That was pretty lower than our study (33.1%), probably due to the difference in percentage of carcinomas (7.2% vs 40.8%).

Interobserver agreement and reproducibility were also indispensable for evaluating a diagnostic model. Recent study demonstrated that EU-TIRADS showed higher reproducibility than ACR-TIRADS and KSThR-TIRADS, possibly due to a lower number of high-suspicious features and more gradual scoring. However, KSThR-TIRADS had the highest reproducibility on FNAB because the majority of low-suspicion and intermediate-suspicion nodules are submitted to FNAB [24]. The interobserver variability of 2015 ATA classification was less reproducible compared with 2011 Korean-TIRADS, owing to higher overall complexity [25].

In addition to US, real-time elastography (RTE) and contrast-enhanced ultrasound (CEUS) had been applied to evaluate the tissue stiffness or blood perfusion of thyroid nodules. Both strain elastography and share wave elasotgraphy (SWE) were useful in malignancy evaluation, with the AUC ranged from 0.89 to 0.93 for strain ratio [26,27,28,29] and 0.91 to 0.94 for SWE [30,31,32]. The isthmic location, larger size, cystic component, inexperienced operator, follicular carcinoma pathology and inability to differentiate malignancy from thyroiditis influenced its efficiency [33,34,35,36,37]. As for CEUS, although no single qualitative or quantitative parameter showed sufficient advantage in the diagnosis of thyroid malignancy [38], it can record better nodule vasculature compared with color Doppler ultrasound. The pooled sensitivity, specificity of CEUS for differentiation of malignant and benign nodules reached 0.9 and 0.86 [39]. Our previous study revealed that combined use of RTE, CEUS, and US could improve the diagnostic efficiency for solid thyroid nodules [40].

Several limitations of our study should also be addressed. Firstly, the extraction of ultrasound features was performed based on static images and reports of US operated by radiologists, which might possibly lead to unavoidable bias. Secondly, the percentage of carcinomas (40.8%) was high in the present study, which may be due to the tertiary referral hospitals enrolled in this study. Such high malignancy rate may cause relatively higher PPV [41]. Thirdly, some nodules were regarded as benign lesions based on cytology or US follow-up, which may cause false-negative results.

In summary, all the three newly-updated TIRADS showed good performance in predicting thyroid malignancy and enabled more personalized and optimized management decision for clinicians. This is the first study to evaluate the diagnostic value of US risk stratification models from KSThR, ETA and ACR and our findings still need to be further validated in a future prospective study and clinical practice.