Introduction

Ultrasonography (US) is the diagnostic modality of choice for the characterization of thyroid nodules [1]. To date, several international societies have developed US-based risk stratification systems, also known as Thyroid Imaging Reporting and Data Systems (TIRADS), to maximize the diagnostic performance of thyroid US and identify those thyroid nodules that should be biopsied [2,3,4,5]. In 2015, the American Thyroid Association (ATA) proposed a qualitative US-based five-tier risk stratification system [3]. The Korean Thyroid Association/Korean Society of Thyroid Radiology (KTA/KSThR) also proposed a risk stratification system (K-TIRADS), which is a pattern-based qualitative system defining four categories with different risks of malignancy [4]. In 2017, the American College of Radiology (ACR) proposed a five-tier risk stratification system (ACR-TIRADS) that was characterized by its quantitative scoring method [2]. In the same year, the European Thyroid Association also proposed a pattern-based qualitative system defining four categories (EU-TIRADS) [5].

Although fine-needle aspiration biopsy (FNAB) has a crucial role in the diagnosis of thyroid cancer, there has been an emphasis on reducing the number of excessive biopsies, which can lead to overdiagnosis and overtreatment, especially considering the less invasive nature of thyroid cancer [6,7,8,9,10]. In this regard, the emphasis in the evaluation of the current TIRADS has shifted from simply evaluating the diagnostic performance to the inclusion of unnecessary biopsy rates. However, there is considerable discordance in the recommended criteria for suspicious US patterns and size cut-offs for FNAB between the TIRADS [11, 12]. In this context, although many authors have attempted to evaluate and compare the unnecessary biopsy rates and diagnostic performance of each system [11,12,13,14,15], the presence of substantial between-study heterogeneity still remains which makes the interpretation difficult. Therefore, we considered it is timely and necessary to summarize the currently available data to provide valuable information for clinical practice and future revisions of the current TIRADS.

Thus, the present systematic review and meta-analysis aimed to evaluate the diagnostic performance and unnecessary thyroid nodule biopsy rates under four representative US-based risk stratification systems: ACR-TIRADS, ATA, EU-TIRADS, and K-TIRADS.

Materials and methods

This study was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [16].

Search strategy and eligibility criteria

A literature search of the MEDLINE/PubMed and EMBASE databases was conducted using pertinent MeSH or EMTREE terms with common keywords for relevant articles up until August 5, 2019. The search terms were as follows: ((thyroid)) AND ((thyroid imaging reporting and data system) OR (TIRADS) OR (TI-RADS) OR (guideline)) AND ((American Thyroid Association) OR (ATA) OR (American College of Radiology) OR (ACR) OR (Europe*) OR (EU-TIRADS) OR (Korea*) OR (K-TIRADS)). The search was limited to English-language publications, but was not limited by human or animal studies, or publication date.

After eliminating duplicates, articles were screened according to their title and abstract. Full-text articles were then thoroughly assessed according to the following eligibility criteria: (a) population: patients who underwent US examinations for thyroid nodules; (b) index test: US-based risk stratification systems according to at least one of the following guidelines: ACR-TIRADS [2], ATA [3], EU-TIRADS [5], and K-TIRADS [4]; (c) reference standard: pathological diagnosis or imaging follow-up; (d) outcomes: unnecessary biopsy rate; (e) study design: not limited. Studies were excluded if any of the following criteria were met: (a) studies including non-consecutive nodules; (b) studies not providing sufficient details to calculate the unnecessary biopsy rate; (c) review articles; (d) case reports or case series including fewer than ten patients; (e) conference abstracts; (f) letters, editorials, and comments; (g) animal studies; (h) studies with a partially overlapping patient cohort (for studies with an overlapping study population, the study with the largest population was selected); (i) studies conducted with a pediatric population; or (j) studies using a pathology reporting system other than the Bethesda classification system [17]. The literature search and application of the criteria were conducted independently by two authors (P.H.K. and C.H.S., with 3 and 8 years of experience in performing thyroid US and interventional procedures, respectively), and any discrepancies were resolved through discussion and consensus with a third author (J.H.B., with 21 years of experience in performing thyroid US and interventional procedures).

Data extraction and quality assessment

A standardized extraction form was used to obtain the following information from the selected studies: (a) study characteristics: institution, study period, study design (prospective vs. retrospective; single-center vs. multicenter), reference standard, and blinding to the reference standard; (b) demographic and clinical characteristics: total number of patients, total number of nodules and malignant nodules, mean age (range), and proportion of female patients; (c) unnecessary biopsy rates; and (d) diagnostic performance of each risk stratification system in the form of a 2 × 2 table, with indication for FNA as the index test [1]. The quality of the selected studies was investigated using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) [18].

Data synthesis and analysis

The primary outcome of this meta-analysis was the unnecessary biopsy rate, defined as the proportion of benign nodules among the biopsied nodules. Meta-analytic pooling was based on the inverse variance method for calculating weights, and their 95% confidence intervals (CIs) were determined using DerSimonian–Laird random-effects modeling. Heterogeneity across studies was assessed using the Q test and I2 statistic, with I2 > 50% being taken to indicate the presence of heterogeneity [19,20,21].

The secondary outcome was the diagnostic odds ratio (DOR) of each system with indication for FNA as the index test. For the meta-analytic pooling of DOR, a bivariate random-effects model with two-by-two tables including true-positive (TP; nodules for which FNAB was indicated and the nodule was found to be malignant), false-positive (FP; nodules for which FNAB was indicated and the nodule was found to be benign), false-negative (FN; nodules for which FNAB was not indicated yet the nodule was found to be malignant), and true-negative (TN; nodules in which FNAB was not indicated and the nodule was found to be benign) findings was constructed for each study. In addition, the pooled sensitivity and specificity and their 95% CIs were calculated, and a coupled forest plot was constructed [20,21,22,23,24]. Indirect comparisons of unnecessary biopsy rates and DORs between the risk stratification systems were performed using a Wald-type chi-square test with multiplicity adjustment, and the regression coefficient was obtained to estimate the intervention effect from a reference group [25, 26]. Statistical analyses were conducted by one of the authors (C.H.S., with 8 years of experience in performing systematic reviews and meta-analyses) using the “metandi” and “midas” modules in Stata 15.0 (StataCorp), and the “meta”, “metafor”, and “mada” packages in R software (version 3.6.2.; R Foundation for Statistical Computing).

Results

Literature search

A flow chart summarizing the publication selection process is presented in Fig. 1. A total of 411 non-duplicate studies were identified. Of these, 307 articles were excluded on the basis of their titles and abstracts because they were not in the field of interest (n = 232), or they were guidelines (n = 63), reviews (n = 8), case reports (n = 2), an erratum (n = 1), or an animal study (n = 1). Subsequently, 104 potentially eligible full-text articles were assessed according to the eligibility criteria, and a further 96 studies were excluded because they included non-consecutive nodules (n = 29), did not provide sufficient details to calculate the unnecessary biopsy rate (n = 29), did not use any of the four risk stratification systems of interest (ACR-TIRADS, ATA, EU-TIRADS, or K-TIRADS; n = 11), used data included in subsequent articles (n = 10), were not in the field of interest (n = 9), included inseparable adult and pediatric patients (n = 6), used a histopathologic reporting system other than the Bethesda system (n = 1), or did not include histopathology as a reference standard (n = 1). Consequently, a total of eight articles including 13,092 thyroid nodules met the eligibility criteria and were included in the analysis [11, 13,14,15, 27,28,29,30].

Fig. 1
figure 1

Flow chart of the publication selection process

Characteristics of the included studies

The detailed study characteristics are summarized in Table 1. One of eight studies was of a prospective design [28], and three were multicenter studies [14, 15, 27, 31,32,33]. The number of included patients ranged from 127 to 3190, and the mean patient age ranged from 44 to 55 years. The proportion of female patients in each study ranged from 61.2 to 86.6%, and the proportion of female patients in the pooled population was 77.7% (8280 out of 10,654; excluding Wu et al [30] in which the data was not available). The proportion of malignant nodules in each study varied from 13.2 to 53.0%, with the pooled proportion being 29.2% (3826 out of 13,092). Unnecessary biopsy rates according to ACR-TIRADS, ATA, EU-TIRADS, and K-TIRADS were reported in eight [11, 13,14,15, 27,28,29,30], five [13, 14, 27, 29, 30], two [11, 15], and five [11, 13,14,15, 27] studies, respectively.

Table 1 Characteristics of the included studies

Quality assessment

The results of the quality assessment based on QUADAS-2 criteria are shown in Supplementary Figure S1. Three studies [11, 14, 28] had an unclear risk of bias in the index test domain because of no or unclear blinding to the reference standard during the US examinations. All eight studies [11, 13,14,15, 27,28,29,30] had an unclear risk of bias in the reference standard domain because of no or unclear blinding to the index test during pathologic evaluation. Additionally, three studies [11, 14, 27] had a high risk and one study [28] an unclear risk of bias in the flow and timing domain because of inconsistency or unclear consistency on the reference standard for diagnosing benign nodules across the study population. Three studies [11, 15, 28] had a high concern on the applicability of the index test because of single or unreported numbers of readers for the US images. One study [28] had an unclear concern on the applicability of the reference standard because of no information on how the tissue specimens were examined. There were no concerns on the applicability of patient selection.

Unnecessary biopsy rates

The pooled unnecessary biopsy rates of ACR-TIRADS, ATA, EU-TIRADS, and K-TIRADS were 25% (95% CI, 22–29%), 51% (95% CI, 44–58%), 38% (95% CI, 16–66%), and 55% (95% CI, 42–67%), respectively (Fig. 2). There was substantial heterogeneity observed with all four risk stratification systems (I2 > 50%). Meta-regression analysis identified that the pooled unnecessary biopsy rate of ACR-TIRADS was significantly lower than that of ATA (OR [95% CI], 1.29 [1.15–1.44]; p < .001) and K-TIRADS (OR [95% CI], 1.34 [1.20–1.49]; p < .001; Table 2), and also lower than that of EU-TIRADS, but not reaching statistical significance (p = .087).

Fig. 2
figure 2

Unnecessary biopsy rates of the four risk stratification systems

Table 2 Results of the meta-regression for unnecessary biopsy rates

Diagnostic performance

The pooled DORs of each system for selecting thyroid nodules for FNA are depicted in Fig. 3. Meta-analytic pooling was not possible for EU-TIRADS as data were available for only two studies [11, 15]. The pooled DORs of ACR-TIRADS, ATA, and K-TIRADS were 5.9 (95% CI, 3.6–9.6), 6.3 (95% CI, 4.5–8.8), and 4.5 (95% CI, 1.7–11.6), respectively. Substantial heterogeneity was observed with all three risk stratification systems (I2 > 50%). Indirect comparisons showed that the DOR of ACR-TIRADS was not statistically different to that of ATA-TIRADS (p = .816) and K-TIRADS (p = .524). Sensitivity analysis excluding Xu T et al [15] due to its relatively lower DOR showed the modest decrease of heterogeneity in ACR-TIRADS (I2, 95% to 76%) and marked decrease of heterogeneity in K-TIRADS (I2, 97% to 0%), with the pooled DORs of ACR-TIRADS, ATA, and K-TIRADS to be 7.0 (95% CI, 5.3–9.2), 6.3 (95% CI, 4.5–8.8), and 6.3 (95% CI, 5.0–7.9), respectively. Indirect comparisons also showed that the DOR of ACR-TIRADS was not statistically different to that of ATA-TIRADS (p = .605) and K-TIRADS (p = .658). The pooled sensitivities of ACR-TIRADS, ATA, and K-TIRADS were 75% (95% CI, 61–84%), 93% (95% CI, 88–95%), and 91% (95% CI, 80–96%), respectively, while the pooled specificities were 67% (95% CI, 61–73%), 34% (95% CI, 26–42%), and 32% (95% CI, 25–39%), respectively. Of note, ACR-TIRADS showed significantly lower sensitivity compared with ATA (p < .01) and K-TIRADS (p < .01), but higher specificity compared with ATA (p < .01) and K-TIRADS (p < .01) (Supplementary Table S1).

Fig. 3
figure 3

Diagnostic odds ratios of the three risk stratification systems

Discussion

The present meta-analysis investigated the unnecessary biopsy rates of each thyroid nodule risk stratification system using eight studies including 13,092 thyroid nodules. The unnecessary biopsy rate was lower with ACR-TIRADS (25%) than with ATA (51%), EU-TIRADS (38%), or K-TIRADS (55%), with this finding being confirmed in the meta-regression analysis. The DOR was comparable between the risk stratification systems. Considering our results and the clinical importance of the unnecessary biopsy rate in the workup of thyroid nodules, future revisions of each system to reduce unnecessary biopsy rates should be made by referring to ACR-TIRADS.

In our meta-analysis, ACR-TIRADS showed the lowest unnecessary biopsy rate among the four risk stratification systems, which is concordant with previous studies [12, 14, 32]. The reason for this low rate can be explained by the minimum FNAB-recommended nodule size with a discordant risk of malignancy in each category. Indeed, a simulation study conducted by Ha SM et al demonstrated that the unnecessary biopsy rates of ATA and K-TIRADS became similar to that of ACR-TIRADS (21%) when the ACR-TIRADS nodule size cut-offs were applied to each category (ATA, 55% to 20%; K-TIRADS, 60% to 26%) [13]. This indicates that unnecessary biopsy rates may be largely determined by the nodule size cut-off for FNAB. In detail, the risks of malignancy and size cut-offs for FNAB in nodules with intermediate suspicion are 5–20% and 15 mm for ACR-TIRADS, 10–20% and 10 mm for ATA, 6–17% and 15 mm for EU-TIRADS, and 15–50% and 10 mm for K-TIRADS [2,3,4,5, 12]. These data show that ACR-TIRADS, ATA, and EU-TIRADS assume similar risks of malignancy, but that ATA sets a smaller size cut-off for FNAB. K-TIRADS assumes a wide range in the risk of malignancy (15–50%) and a 10-mm size cut-off for FNAB. For low-suspicion nodules, the risks of malignancy and size cut-offs for FNAB are 5% and 25 mm for ACR-TIRADS, 5–10% and 15 mm for ATA, 3–15% and 15 mm for EU-TIRADS, and 2–4% and 20 mm for K-TIRADS, showing that the four systems assume a similar risk of malignancy, but that ACR-TIRADS has the largest size cut-off for FNAB. Furthermore, Yim Y et al reported a high concordance between ACR-TIRADS, ATA, and K-TIRADS for high- or intermediate-suspicion nodules, indicating that the size cut-off for FNAB is the main factor influencing diagnostic performance [31]. Therefore, an understanding of the impact of size cut-offs for each category seems necessary for future TIRADS.

Our analysis showed that ACR-TIRADS showed comparable DOR, but lower sensitivity and higher specificity to ATA and K-TIRADS. These differences were also reported in the previous studies [12, 32]. This can be at least partially explained by the nodule size cut-off for FNAB, as elucidated by the simulation study by Ha SM et al [13]. In their study, when similar nodule size cut-offs to those used in ACR-TIRADS were applied to each category, the sensitivity of ATA and K-TIRADS decreased, but the specificity and accuracy increased (ATA: sensitivity, 92% to 61%; specificity, 34% to 76%; accuracy, 44% to 73%; K-TIRADS: sensitivity, 94% to 64%; specificity, 29% to 69%; accuracy, 39% to 68%).

Recently, many efforts have been made to improve the risk stratification systems for thyroid nodules [11,12,13]. In current practice, the mortality rates of thyroid cancer have not changed, although there has been an increasing incidence of thyroid cancer [9, 10], implying a tendency to overdiagnosis. Therefore, an optimal risk stratification system requires both low rates of unnecessary biopsies and high discriminatory power to select nodules requiring FNAB, thereby reducing patients’ discomfort and anxiety, and reducing medical costs associated with excessive biopsies. Thus, we evaluated the current risk stratification systems in terms of unnecessary biopsy rates and DOR to measure the discriminatory power of the diagnostic tests. As the DOR is independent of the frequency of events in the study population (e.g., the proportion of malignant nodules in each study) [33, 34], it can minimize associated bias. Furthermore, DOR is a single indicator that makes comparisons between diagnostic tests simple. Indeed, the conventional indicators that have been used to evaluate TIRADS (e.g., sensitivity and specificity) explain only a part of the diagnostic performance and are thus not decisive by themselves, making it difficult to simply rank different TIRADS. Therefore, the use of DOR seems appropriate in our study, and it may also be useful in future research. Considering our results, future revisions should take reducing overdiagnosis into account, thus minimizing unnecessary biopsies by referring to ACR-TIRADS.

However, it should be also emphasized that just reducing unnecessary biopsy rates is not always a right answer. In other words, reducing unnecessary biopsy rates may adversely increase the risk of missed malignancy. Indeed, we showed that ACR-TIRADS demonstrated the lowest sensitivity (75%) among the risk stratification systems. Of course, the probability of malignancy among the examined nodules is low, and one retrospective study reported that only 1.2% (17/1382) of nodules in which FNAB was not required according to ACR-TIRADS was confirmed as malignancy [35]. However, to our knowledge, there is no large prospective study evaluating whether reducing unnecessary biopsy rates is indeed beneficial in terms of cost-effectiveness without a negative impact on survival. Further studies seem to be necessary to clarify this issue.

Our study has several limitations of note. First, all studies except one were retrospective, implying a potential misclassification due to unstandardized image acquisition during the examination. Second, the included studies presented heterogenous minimum nodule size cut-offs for inclusion, and therefore a study-level meta-analysis of nodules larger than 1 cm was not possible. In addition, national/institutional policies for biopsy might act as a confounder. Third, the included studies were performed in tertiary referral hospitals, and therefore the data presented in this study might not reflect the actual primary care setting. Fourth, the influence of interobserver variability and clinical expertise could not be evaluated. Finally, there were substantial heterogeneity noted both in the pooled unnecessary biopsy rates and DOR. To overcome this, we performed meta-regression and sensitivity analyses, but heterogeneity was not much resolved. Those might be due to inconsistent minimum nodule size cut-offs for the inclusion and heterogenous classification of the nodules between the studies. In particular, follicular neoplasms were regarded as indeterminate cytology and excluded from the analysis in the study by Wu et al [30] but were included and classified based on their surgical pathology in some studies [13, 14, 27]. These unresolved heterogeneities might affect the credibility of the results.

In conclusion, ACR-TIRADS showed a lower unnecessary biopsy rate than the other risk stratification systems albeit DOR was comparable between ACR-TIRADS, ATA, and K-TIRADS. Future revisions of each system should be made by referring to ACR-TIRADS to reduce unnecessary biopsy rates.