Introduction

About 70 % of adults have thyroid nodules on neck ultrasound [1, 2]. Most of these nodules are never discovered; however, the advent of imaging techniques has contributed to their identification [3, 4]. Nodule features detected through thyroid ultrasound help clinicians stratify patients according to their risk of cancer before fine needle aspiration biopsy is performed [58]. Given that 5–20 % of patients with thyroid nodules have thyroid malignancy, clinical guidelines recommend thyroid ultrasonography in the evaluation of thyroid nodules [5, 9, 10]. Suspicious nodules then undergo ultrasound-guided fine needle aspiration biopsy (USFNA) and cytological evaluation. In general, clinicians will continue to monitor thyroid nodules with benign cytology, while those with malignant cytology will proceed to surgery. Nodules with indeterminate cytology may receive diagnostic surgery, molecular testing, or ongoing monitoring depending on the preferences of clinicians and patients [10].

This widely accepted diagnostic strategy for thyroid nodules has two important limitations. First, estimates of the accuracy of each USFNA finding to detect malignancy were reported mostly in single-center studies and vary greatly across studies [11, 12]. Second, once obtained, cytological findings loom large and dominate the diagnostic process, all but ignoring the likelihood of thyroid cancer gleaned from the patient’s history and/or from the ultrasonographic findings [5, 6]. Improvements in the estimates of diagnostic accuracy of USFNA findings and integration with estimates of the probability of thyroid malignancy should improve the diagnostic process.

The goals of this study were to systematically appraise and summarize the available evidence about the diagnostic accuracy of USFNA findings, and to explore the integration of these estimates with the probability of thyroid malignancy before USFNA.

Methods

We performed a systematic review and meta-analysis regarding the diagnostic characteristics of USFNA for thyroid malignancy following a protocol and current guidelines on the conduct and reporting of systematic reviews of diagnostic accuracy [13].

Study selection

Type of studies

We searched for reports of randomized clinical trials, complete cohort, and case-control studies that enrolled patients of any age and gender with thyroid nodules undergoing USFNA published in English. We excluded studies involving patients with history of thyroid cancer, those reporting only on a single diagnostic category and studies of fine needle aspiration without ultrasound guidance.

Type of intervention (index test)

The test of interest was USFNA with results reported using the Bethesda system, the British classification, or a 4-category scheme (benign, malignant, nondiagnostic, and indeterminate) [14, 15]. This test involves the identification of a thyroid nodule by ultrasound and ultrasound-guided insertion of a fine needle for cell aspiration and subsequent cytological interpretation.

Type of outcomes (reference test)

Our outcome of interest was the diagnostic accuracy of USFNA of thyroid nodules for thyroid malignancy. The reference standard was histopathological diagnosis from a surgical specimen. When histology was not available, i.e., when thyroid nodules with benign cytology were followed clinically rather than surgically removed, we considered long-term follow-up without the emergence of malignant features as an alternative reference standard. We described the accuracy of the USFNA findings using the likelihood ratios (LR) for each diagnostic category and for each reporting system (Bethesda, British, four categories). LRs quantify the ability of the test result to modify the pre-test probability; a LR of 1 does not change it; 0.2 or 5 produce moderate changes; 0.1 or 10 produce large changes to the pre-test probability and are often seen with rule-in or rule-out test results [16, 17]. Formally, LR is defined as the probability of a specific finding in patients with disease over the probability of the same finding in patients without disease [16, 17]. We preferred to use these measures over the more traditional sensitivity and specificity, because they offer a more useful tool to integrate diagnostic accuracy measures into clinical decision-making [1618].

Data sources and searches

An experienced reference librarian (P.E.), working with the study’s lead investigator (N.S.O.), designed and conducted a comprehensive search strategy using controlled vocabulary and keywords for the concepts of diagnostic accuracy, USFNA, and thyroid cancer. The strategy comprised Ovid Medline In-Process and Other NonIndexed Citations, Ovid MEDLINE, Ovid EMBASE, Ovid Cochrane Central Register of Controlled Trials, Ovid Cochrane Database of Systematic Reviews, and Scopus, from each database’s inception to August 2014 (Online Appendix 1). We also searched the reference list from previous systematic reviews and consulted with local thyroid disease experts seeking additional references that may have been missed by our database search strategy.

Selection of studies

Reviewers, working independently and in duplicate, reviewed all abstracts and titles for inclusion. After abstract screening and retrieval of potentially eligible studies, the full text publications were assessed for eligibility. Disagreements were resolved by consensus.

Data extraction

Using a standardized web-based form, reviewers collected the following information, independently and in duplicate from each study: country where study was conducted, number of patients and of thyroid nodules, average age, gender, and thyroid nodule size. Additionally, we extracted technical aspects of the index test such as operator experience (years performing USFNA), method of aspiration (negative pressure vs. capillary), number of USFNA passes as well as any measurement of inter-observer variability during the cytological examination of the specimens. Finally, we extracted true positive, true negative, false negative, and false positive values to construct a diagnostic contingency (or 2-by-2) table for each cytological outcome within each of the reporting systems evaluated.

Quality assessment

Reviewers, working independently and in duplicate, analyzed the eligible articles to assess their methods to protect them from bias using the Quality Assessment of Studies of Diagnostic Accuracy included in systematic reviews (QUADAS-2) tool [19, 20]. This tool assesses the risk of bias and applicability in terms of patient selection, index test, reference standard, flow and timing [19, 20].

Author contact

We attempted to contact 47 authors of included studies by email to verify and complete data that could not be discerned from the report; only 12 responded. We excluded studies for which required information to judge eligibility or to conduct key analyses remained unavailable after thorough review of the published report and author contact.

Subgroup and sensitivity analysis

To explain possible inconsistencies across study results we planned on conducting the following subgroup analyses for diagnostic accuracy of USFNA: method of USFNA, operator experience, needle gauge, number of aspirations, cytological examination, probe frequency, reporting system for the index test, reference standard use, risk of bias and age of population.

Data synthesis and statistical analysis

Meta-analysis

We used the random-effects model to pool likelihood ratios for the diagnosis of thyroid malignancy and their respective 95 % confidence intervals. This analysis was carried out in Meta-Disc [21] and Review Manager (Revman) [22]. This model produces variance estimates that accounts for within-study variance (precision) and between-study differences in patients, methods, test performance, and reference standard performance (inconsistency). We used the I 2 statistic to assess for inconsistency across individual studies, with I 2 > 50 % indicating large inconsistency [23]. As a sensitivity analysis, we conducted a bivariate meta-analysis, an approach that takes into account potential threshold effects and the correlation between sensitivity and specificity [24], as implemented in Statistical Analysis System (SAS) software.

Results

Study identification

Our literature search identified 1085 abstracts, of which 32 studies were eligible (Fig. 1) [2556]. These studies were mostly conducted in adults, included more than 79,541 thyroid nodules (number of nodules not reported in one study) with 15,641 undergoing surgical intervention, of which 27 % harbored malignancy (Table 1). Due to inconsistent reporting of long-term follow-up, the diagnostic accuracy analysis was based only on histopathology as a reference standard. Of the 32 included studies, 22 reported the USFNA outcomes in four categories [2734, 36, 37, 39, 41, 42, 4548, 5153, 55, 56], eight used the Bethesda system [25, 26, 35, 43, 44, 49, 50, 54], and 2 the British classification [34, 36].

Fig. 1
figure 1

Study selection

Table 1 Description of the included studies

Risk of bias

Overall, the risk of bias was moderate to high due to the study of nonconsecutive samples, lack of blinding when retrospectively assessing the results of the index and reference tests, having few patients undergo evaluation with the histopathological reference standard, and having inconsistent follow-up and reporting of those followed clinically. These features increase the risk of overestimation of the diagnostic accuracy of the index test. A summary of the risk of bias evaluation of the included studies is depicted in Fig. 2, representing the proportion of studies deemed at high, moderate, or low risk of bias by the reviewers in each of the seven QUADAS-2 domains.

Fig. 2
figure 2

Risk of Bias using the QUADAS-2 instrument

Diagnostic accuracy by USFNA outcome and reporting system

The diagnostic accuracy of USFNA was analyzed according to USFNA outcome (all reporting systems included), the Bethesda reporting system, or a 4 category reporting system (Table 2). Overall, results in the categories of benign, malignant, and suspicious for malignancy were associated with LR that significantly changed the pre-test probability. A benign USFNA result was associated with a LR of 0.09, 95 % CI 0.06–0.14, I 2 = 33 %, whereas a malignancy finding on USFNA was associated with a LR of 197, 95 % CI 68–569, I 2 = 77 %. These results suggest these USFNA results (benign and malignant) all but rule-out and rule-in malignancy, respectively. Consistent results were found across reporting systems (Table 2). A distribution of histology results according to USFNA category and reporting system is found on Table 3.

Table 2 Summary of the meta-analysis of diagnostic performance of USFNA for the diagnosis of thyroid malignancy
Table 3 Distribution of histology results by USFNA category and reporting system

Due to inconsistent reporting, we were unable to perform any of the planned subgroup analyses. As a sensitivity analysis, we evaluated only studies deemed to be at lower risk of bias (5 out of 7 QUADAS-2 criteria considered low risk). These estimates were compatible with the main analysis. (Supplemental Table 1).

Discussion

Summary of the evidence

The body of evidence about the diagnostic accuracy of USFNA is at moderate risk of bias which may overestimate diagnostic accuracy estimates. These estimates are imprecise (wide confidence intervals) despite pooling and come from studies that report inconsistent estimates, inconsistency that cannot be fully explained by the use of different reporting systems. With these caveats, USFNA results that are reported as benign, suspicious for malignancy, or malignant have a large impact on the probability of thyroid malignancy after the test; other results have a smaller effect making management decisions dependent on other clinical variables (nondiagnostic and indeterminate results). These findings correlate with a study that evaluated the concordance between central and local histopathologists when interpreting USFNA results. As in our review, the benign and malignant categories were associated with the highest concordance between pathologists; with central pathologist reporting a fewer number of indeterminate diagnosis (found to be less helpful clinically) compared to the local pathologists, highlighting the importance of experience in the interpretation of USFNA [57]. In addition, even when the USFNA biopsy has been carefully performed and interpreted, the test has an intrinsic limitation when trying to differentiate follicular adenomas from carcinomas [14].

Limitations

Our review is limited by the possibility that our search strategy may have missed studies, our language restriction, and that studies had to be excluded due to incomplete reporting of data. We have tried to mitigate these limitations by engaging an expert librarian in designing the search strategy [58] and by contacting study authors although this was accomplished with only partial success.

Additionally, pooling methods such as the random effect model have been questioned in diagnostic systematic reviews. To overcome this potential limitation, we conducted the analysis using a bivariate model. This method accounts for the aforementioned variation and analyzes sensitivity and specificity jointly [24]. Furthermore, we were not able to conduct subgroup analysis of accuracy of USFNA based on important clinical factors and help explain inconsistency in results across studies. Due to the moderate risk of bias of the included studies (given than only a minority of patients undergo histological evaluation), our results can overestimate the diagnostic accuracy of USFNA. Also, our findings do not apply to the less common thyroid malignancies poorly represented in the included studies.

Applications for practice

To better understand these results, it is best to apply these findings to clinical cases. Consider a nodule labeled as highly suspicious for thyroid cancer based on US features, that is, for every 10 such nodules we would expect to find malignancy in eight of them [10]. If the USFNA result is benign that number drops to about 3 in 10. One would hardly stop evaluation and treatment at such a high likelihood of malignancy. But one would stop for a nodule characterized on US features as having a low suspicion for malignancy, that is, for every ten such nodules at most we would expect to find malignancy in one of them [10]. In that case, a benign USFNA result all but excludes (<1 %) the possibility of malignancy. So the same test result, a benign USFNA has different implications depending on the probability of malignancy—giving the characteristics of the patient and of the nodule on US—before the USFNA is performed.

Another example would be if a highly suspicious nodule is found to have a biopsy suspicious for follicular neoplasm, where the risk of thyroid cancer will be 71 %. In this case, patients and clinicians might feel comfortable with a recommendation for diagnostic thyroidectomy. In contrast, a nodule with intermediate suspicion for malignancy based on US features and the same USFNA result (suspicious for follicular neoplasm) will have a 10 % risk of malignancy. In this case, further diagnostic testing (e.g., molecular testing) might be helpful before diagnostic thyroidectomy is recommended.

Clinicians make intuitive and often implicit judgments as they receive reports from US and USFNA about the likelihood of malignancy. These may be open to error that an explicit approach may reduce. Similarly, an explicit approach may also be helpful in discussing cases with trainees who have yet to form the kinds of intuitions that guide judgments in more senior colleagues. Several approaches are available to bring this so-called Bayesian reasoning, to the clinic [18]. To facilitate a more explicit use of the estimates reported here, we have developed a calculator to determine the risk of malignancy for thyroid nodules after a USFNA (freely available at http://www.thyroidcarisk.mayo.edu/). In this tool, the user enters a best guess of the probability of thyroid malignancy based on clinical judgment (informed by prevalence, clinical, and ultrasound features and their experience) or, alternatively, uses the ATA ultrasonography risk stratification system for thyroid nodules [10]. Then the user selects an USFNA finding (e.g., malignant, benign). The resulting post-test probability is represented in a 100-person pictogram [59]. This tool brings visual clarity to the process, allows for what-if scenarios, and offers the opportunity to involve patients into the decision-making process [60].

Applications for research

Our study identifies large knowledge gaps in the diagnostic process of a common clinical problem, the management of thyroid nodules. To the subjective estimates of a probability of malignancy from clinical and ultrasound features, we add imprecise and perhaps biased estimates of the LR for USFNA results based on the best available evidence. At the extremes of these ranges, the resulting probability of malignancy after USFNA may suggest different ‘next steps’ emphasizing the need for additional research to improve the validity and precision of our diagnostic estimates. The Institute of Medicine has recently called for a greater investment in diagnostic research to improve its contribution to the value and safety of clinical care [61].

Opportunities for research are rich. For example, tools that integrate clinical and ultrasound features to determine a more precise pre-test probability are required. Moreover, given that an important limitation of the included studies is that these estimates are based only on the patients that underwent surgery, [62] diagnostic studies with ideally low risk of bias that provide consistent and complete follow-up for patients should be conducted. Finally, this framework provides clinicians with the opportunity to involve patients into the decision-making process, especially when deciding the next step of management based on action thresholds taking into consideration their values and preferences. However, formal studies assessing the impact of the use of this tool on patient important outcomes and on engaging them in shared decision-making would be needed.

Conclusion

The available evidence regarding the diagnostic accuracy of USFNA warrants only limited confidence due to risk of bias, imprecision, and inconsistency. However, some USFNA results (benign, malignant) are likely very helpful, by significantly changing the pre-test probability of thyroid cancer.