Introduction

Although disease etiology is still unclear [1], thyroid carcinoma, accounting for 95% of endocrine malignancy, is the sixth most common cancer in women in the USA, and an estimated over 52,000 new cases occurred in men and women in 2019 in this country [2]. Differentiated thyroid cancer (DTC) is the most frequent subtype of thyroid cancer, including papillary thyroid carcinoma (PTC) and follicular thyroid cancer. Like other primary malignancies of the head and neck, PTC presents a consistent pattern of metastases to regional lymph nodes (LNs), ranging from 20 to 50% of cases [3]. Although most patients with DTC have a favorable long-term survival, 5–20% of patients will develop recurrence in LNs during postoperative follow-up, and quite a few patients even present LN metastasis (LNM) at initial diagnosis [4].

A LN burden (ratio of positive LNs to total removed LNs) > 17% in the lateral neck is predictive for recurrence in patients of all ages, and patients with ≥ 2 metastatic LNs might benefit from total thyroidectomy and radioactive iodine therapy [5]. The 2015 guidelines of the American Thyroid Association for the diagnosis and management of DTC recommend that the central and lateral cervical LNs should be examined by routine US before surgery. LNs that are suspicious for DTC metastases on US should undergo US-guided fine needle aspiration cytology (FNAC) [6]. However, several factors may decrease the diagnostic efficacy of FNAC, like small LN size, lymphocyte infiltration, necrosis, or lack of epithelial component in cyst aspirates [7]. In that case, measurement of thyroglobulin (Tg) levels in FNA material has provided additional clues.

The diagnostic accuracy of FNA-Tg variates, due to relatively small sample sizes, different threshold value choices, or distinct clinical settings (before or after thyroidectomy) [8, 9]. In addition, those studies used different detection methods for FNA-Tg measurement, such as immunoradiometric assay (IRMA), immunochemiluminometric assay (ICMA), electrochemiluminescence immunoassay (ECLIA), immunometric assay (IMA), and immunofluorometric assay (IFMA), which might affect the accuracy of analysis since different assays have different analytical sensitivity (Se), functional Se, and inter-assay variability [10, 11]. Therefore, the diagnostic performances of FNA-Tg and its combination with FNAC in identifying LNM need determination under different conditions. This prompted assessment of diagnostic test accuracy by a meta-analysis.

In the era of evidence-based medicine, decision-makers need high-quality data to support decision to use a diagnostic test in a specific clinical situation and to choose the relevant one. Meta-analysis of diagnostic test accuracy studies is a useful method to increase the level of validity by combining data from multiple studies. Statistics dedicated to meta-analyses of diagnostic test accuracy provide either summary points of diagnostic accuracy (e.g., summary (Se), specificity (Sp), positive and negative likelihood ratios, and diagnostic odds ratio (DOR)) or summary lines (i.e., summary receiver operating characteristic [SROC] curves). The DOR takes advantage of accuracy as a single indicator, which is closely linked to both Se and Sp and expresses the strength of the association between test result and disease [12]. The current study had two aims: (1) to compare the diagnostic efficacy of FNAC, FNA-Tg, and their combination at preoperative and postoperative stages; (2) to explore the appropriate cut-off of FNA-Tg in diagnosing LNM of DTC.

Methods

Search strategy

Based on the recommendation in the Cochrane handbook for systematic reviews of diagnostic test accuracy (handbook.cochrane.org) [13], two independent investigators predefined the retrieval strategy and conducted a systematic search of bibliographic databases (PubMed, Web of Science, the Cochrane Library, Google Scholar, and CNKI) using different combinations of the search terms including “FNAC”, “FNA-Tg/Tg”, “lymph node/lymph node metastases”, and “thyroid cancer/thyroid carcinoma”, for articles published up to October 2019. Furthermore, references cited in each identified literature were further searched manually for potential available studies. We deleted duplicated and unrelated articles by reading titles and abstracts, and then further excluded articles not meeting the inclusion criteria by reviewing the full text. We did not set any language restrictions to our search; any differences in the process of article searching and data extracting were resolved through team discussion. We contacted the author for specific raw data if the data provided in the article were not sufficient. To avoid the involvement of duplicated samples, authors of papers that appeared to be relevant were contacted via email if available. If overlapping data by the same first author were found, the article with the largest number of subjects was included.

Inclusion criteria

All the selected studies should meet the following criteria: (1) malignant LNs were confirmed through pathological examination after surgical resection (negative LNs not resected operatively could be confirmed if they stay negative after 2 years’ follow-up); (2) detection results of FNAC+FNA-Tg should be presented; (3) data such as true positivity (TP), false positivity (FP), true negative (TN), and false negative (FN) should be provided or could be calculated with sensitivity (Se), specificity (Sp), accuracy, and so on.

Data extraction and quality assessment

Two reviewers independently extracted the following data: the first author’s name, year of publication, study design, sample size of cases and controls, pathological type of disease, preoperative or postoperative diagnosis of LN metastases, cut-offs of FNA-Tg, Se, Sp, TP, FP, TN, and FN of FNAC, FNA-Tg, and FNAC+FNA-Tg. The included articles were evaluated item-by-item according to the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) checklist [14]. Risks of bias in regard to four key domains including patient selection, index test, reference standard, and the flow and timing were evaluated, and the first three domains were applied to assess applicability concerns. If a study is judged as “low risk” on all domains, then it should be considered as “low risk of bias” or “low concern regarding applicability.” If a study is classified as “high risk” or “unclear” in one or more domains, then it should be judged as “at risk of bias” or having “concerns regarding applicability.” Disagreements generated in this process were resolved by consensus.

Statistical analysis

The threshold effect was evaluated by assessing the shape of a ROC plane plot and the Spearman’s correlation coefficient between the logit of Se and the logit of 1-Sp; p value for Spearman’s correlation coefficient (pa) < 0.05 indicated the existence of a threshold effect. The inconsistency index I-square (I2) and Cochran’s Q were calculated to check heterogeneity caused by non-threshold effect factors, and we set I2 > 50% as substantial heterogeneity. The positive likelihood ratio (PLR, Se/1-Sp) represents the ratio of TP and FP and the negative likelihood ratio (NLR, 1-Se/Sp) equals to the ratio of FN and TN; they act as relatively independent estimators of how much a test result will change the odds of having a disease [15]. The SROC curve with its corresponding area under the curve (AUC) and Q value is a suitable way to evaluate the stability and accuracy of a test. The Q value is the intersection point of the SROC curve with a diagonal from top left to bottom right, which corresponds to the highest value of Se and Sp for the test. p value for the difference between the b value and zero (pb) < 0.05 indicated that an asymmetrical SROC should be plotted; otherwise, the symmetrical SROC was applied. The DOR was computed using the Moses-Shapiro-Littenberg model for symmetrical or asymmetrical SROC, which reflects the relationship between the result of the diagnostic test and the disease. All the statistical outcomes and result plots were produced using Meta-DiSc1.4 software.

Results

Search results and characteristics of included studies

The flow chart summarizing the study screening process is shown in Fig. 1. The initial search provided a total of 407 records, of the 350 records left after removal of duplicate ones, and 281 unrelated ones were further excluded on the basis of titles and abstracts. Next, full texts of the remaining 69 articles were carefully read and screened according to our predefined inclusion criteria, and 48 articles were discarded due to the absence of data to calculate diagnostic parameters (n = 22), no FNAC+FNA-Tg combined diagnostic data (n = 10), insufficient sample size (n = 9), lack of pathological diagnosis (n = 5), and no mentioned FNA-Tg cut-off value (n = 2). Finally, 17 English [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] and 4 Chinese [33,34,35,36] articles including 1662 malignant and 1279 benign LNs from 2712 patients with DTC were involved into our meta-analysis.

Fig. 1
figure 1

Flow chart of study selection and inclusion in the meta-analysis. Excluded PubMed (single asterisk symbol); other databases searched: Google Scholar, Baidu Scholar, Global Health (excluded PubMed) (double asterisk symbols)

Among them, 8 studies researched on LNs from preoperative patients, 7 studies included LNs from postoperative patients, 2 studies involved both preoperative and postoperative patients, and 4 studies did not clarify if the patients had received a thyroidectomy. Different detection methods for FNA-Tg were taken: 8, 5, 4, 1, and 1 studies took use of IRMA, ICMA, ECLIA, IMA, and IFMA, respectively, and the remaining 2 studies did not describe what methods they had used. Only about half of the included studies had reported the analytical Se and functional Se of the detection methods for FNA-Tg. Three studies took FNA-Tg/serum-Tg ratio while the others took FNA-Tg levels as the positive standard for the FNA-Tg test. All the above characteristics of the included studies are summarized in Table 1.

Table 1 Characteristics of included studies

Study quality evaluated by QUADAS-2

Based on the QUADAS-2 tool, the patient selection of all the studies matched corresponding signal questions, so it was of low risk of bias and applicability concerns. While, several studies might introduce high or unclear risk of bias in terms of index test and flow and timing, because they did not interpret the diagnostic results of FNAC in a blind manner or did not pre-establish a diagnostic threshold for FNA-Tg interpretation. In addition, part of the involved studies employed different FNA-Tg cut-offs or set a cut-off value after summarizing all the detection data, which might bring about concerns regarding applicability of the index test (Fig. 2).

Fig. 2
figure 2

Quality evaluation of the included studies according to QUADAS-2 guideline. a Proportion of studies with low, high, or unclear risk of bias and concerns regarding applicability. b Heatmap exhibiting the detailed assessments of involved studies in each key domain of risk of bias and applicability concerns. The included studies were referred as author name and publication year and listed in an alphabetical order of author names

Heterogeneity evaluation

As it could be seen from the ROC plane plots of the total group, no typical “shoulder arm” pattern was presented for FNAC, FNA-Tg, or FNAC+FNA-Tg (Fig. 3). All the pa values were > 0.05, no matter the preoperative subgroup or the postoperative subgroup, suggesting that there was no statistical threshold effect existing in all the included studies (Table 2). The pb values in preoperative and postoperative subgroups were all > 0.05, so symmetrical SROC curves were plotted, while pb values in the total group were all < 0.05 (Table 2), so asymmetrical SROC curves were fitted for FNAC, FNA-Tg, and FNAC+FNA-Tg, respectively.

Fig. 3
figure 3

ROC plane plots for FNAC, FNA-Tg, and FNAC+FNA-Tg

Table 2 Statistical results of Spearman’s correlation coefficient

For non-threshold effect, I2 and Cochran’s Q for DOR were evaluated. In the total group, statistical heterogeneities existed in the series of studies of FNAC (I2 = 62.4%, p for Cochran’s Q = 0.0001) and FNA-Tg (I2 = 63.5%, p for Cochran’s Q = 0.0001) but not in FNAC+FNA-Tg (I2 = 26.24%, p for Cochran’s Q = 0.1581) (Table 3). To deal with the heterogeneity and evaluate the robustness of our results, a random effects model was employed to calculate the pooled results, the data was reanalyzed after removing some studies that visibly deviated from the overall, and subgroup analysis was performed to explore the sources of heterogeneity.

Table 3 The diagnostic performance of FNAC, FNA-Tg, and FNAC+FNA-Tg

Diagnostic performances of FNAC, FNA-Tg, and FNAC+FNA-Tg in the total group

FNAC

The summary Se and Sp with 95% CI were 0.82 (0.80–0.84) and 0.98 (0.97–0.99), respectively, indicating that FNAC is quite specific for LNM but it might produce false negative results to a certain degree. DOR of FNAC in diagnosing LNM was 183.75, and the computed AUC of SROC curve was 0.9353 (Figs. 4, 5, and 6; Table 3). We noticed that two studies (Cunha 2007 and Holmes 2014) had an extremely high Se of FNAC (Fig. 4), and another two studies (Al-Hilli 2016 and Sigstad 2007) presented an obvious low Sp of FNAC (Fig. 5). To evaluate the stability of the results, we took each of the above four studies separately out from the whole, and found the pooled Se and Sp both remained as the same. When the Al-Hilli 2016 study was removed, the DOR and AUC increased to 217.26 and 0.9644; I2 for DOR decreased a little but was still > 50% (from 62.4 to 58.0%). The DOR, AUC, and I2 for DOR did not changed much when any of the other three studies was deleted.

Fig. 4
figure 4

Combined Se for FNAC, FNA-Tg, and FNAC+FNA-Tg

Fig. 5
figure 5

Pooled Sp for FNAC, FNA-Tg, and FNAC+FNA-Tg

Fig. 6
figure 6

SROC curves with AUC and Q value for FNAC, FNA-Tg, and FNAC+FNA-Tg

FNA-Tg

The combined Se, Sp, and DOR with 95% CI of FNA-Tg in recognizing LNM were 0.93 (0.92–0.95), 0.92 (0.91–0.93), and 155.17 (86.39–278.74), respectively, and the corresponding AUC of SROC curve was 0.9674 (Figs. 4, 5, and 6; Table 3), indicating that FNA-Tg was more sensitive but less specific than FNAC. As shown in Fig. 5, four studies (Al-Hilli 2016, Kim 2009, Shi 2015, and Yap 2014) reported a much lower Sp of FNA-Tg than the others. We removed the above studies one by one and found the Sp, DOR, and I2 for DOR only fluctuated minutely, suggesting the meta-analysis results of FNA-Tg were stable.

FNAC+FNA-Tg

The integrated Se and Sp with 95% CI of FNAC+FNA-Tg were 0.97 (0.96–0.98) and 0.94 (0.93–0.95), respectively. The DOR and the AUC of SROC curve of FNAC+FNA-Tg to identify LNM were 446.00 and 0.9862 (Figs. 4, 5, and 6; Table 3), higher than those of either FNAC or FNA-Tg alone. There were four studies (Al-Hilli 2016, Kim 2009, Sigstad 2007, and Zhang 2014) obviously deviating from the whole in Sp (Fig. 5). When removing the Kim 2009 study out of the overall, we noticed an increase of DOR (from 446.00 to 513.61) and AUC (from 0.9862 to 0.9913) and a decrease of I2 for DOR (from 23.8 to 0%). The one-by-one deletion of the other three studies did not alter the Sp or the DOR greatly (Sp fluctuated from 0.94 to 0.95, DOR ranged from 459.49 to 501.01).

Diagnostic performances in preoperative and postoperative subgroups

One test could perform differently in different clinical applied situations. As it could be seen from Table 4, the diagnostic efficacy of FNAC in the postoperative subset (DOR 141.82, AUC 0.9560) was higher than that in the preoperative subset (DOR 86.34, AUC 0.9488), while FNA-Tg performed a bit better in the preoperative subgroup (DOR 94.10, AUC 0.9727) than in the postoperative subgroup (DOR 91.37, AUC 0.9510). Noticeably, no matter in prior to or after operation, the combination of FNAC and FNA-Tg performed more excellent than either test alone, and the best diagnostic performance was achieved by FNAC+FNA-Tg when applied postoperatively to monitor recurrence (DOR 788.72, AUC 0.9930). The heterogeneity for FNAC+FNA-Tg was much lower than that for FNAC and FNA-Tg alone, no matter preoperatively or postoperatively. The heterogeneity was obviously high in the preoperative subgroup for FNAC and FNA-Tg (I2 > 50% and p for Cochran’s Q < 0.05), indicating non-threshold effect might affect the result.

Table 4 The diagnostic performance in preoperative and postoperative subgroups

Subgroup analysis based on different cut-offs of FNA-Tg

The included studies had employed different numerical values of FNA-Tg or the combination of FNA-Tg and serum-Tg as the cut-offs in Tg measurement, which might influence the Se and Sp of each individual study, thus accumulating obvious heterogeneity and decreasing the diagnostic performance of the pooled analysis. To explore the influence of different positive standards on the diagnostic performance of FNA-Tg and FNAC+FNA-Tg, we divided these studies into four subgroups: 0 < cut-off < 1.0 ng/ml, cut-off = 1.0 ng/ml, cut-off > 1.0 ng/ml, and FNA-Tg/serum-Tg ratio > 1. The number of studies that fell into the above subgroups was 7, 6, 5, and 3, respectively. Interestingly, no matter which cut-off was used, it was for sure that the combination of FNAC and FNA-Tg performed better than either alone (Table 5). The combination of FNAC and FNA-Tg/serum-Tg ratio achieved a much higher pooled Se, Sp, and DOR (Se, 0.91; Sp, 0.98; and DOR, 448.05) than the combination of FNAC and FNA-Tg. The best Se (0.99) and AUC (0.9949) of FNAC+FNA-Tg was reached when using cut-offs < 1.0 ng/ml, while the highest Sp of FNAC+FNA-Tg was obtained when using FNA-Tg/serum-Tg ratio > 1 as the cut-off (Table 5).

Table 5 The subgroup analysis of FNA-Tg based on different cut-offs

Discussion

This study summarized the diagnostic performance of US-guided FNAC, FNA-Tg, and the combination of both for nodal metastasis assessment in preoperative and postoperative patients with DTC. Integrating results of the included studies, we found that FNAC had a better Sp while FNA-Tg got a higher Se, and that they could achieve a more excellent diagnostic performance when analyzed jointly than independently. FNAC and its combination with FNA-Tg performed better when applied to recognize recurrence postoperatively, while FNA-Tg performed better when used to confirm staging preoperatively. On the whole, the combination of FNAC and FNA-Tg/serum-Tg ratio achieved a better diagnostic performance than the combination of FNAC and FNA-Tg. However, we should be aware that different categories of cut-off values may be suitable for different clinical situations.

A complete LN dissection is the key step of initial treatment for DTC. To reduce the number of preventative bilateral neck LN dissection and to build a personalized therapeutic strategy, the surgeons should be well informed of the metastatic status of neck LNs of individual patient [37]. After the initial treatment consisting in surgery with or without radioactive iodine, a timely recognition of recurrence in LNs could guarantee a timely intervention. FNAC is the most direct way to examine LNM before or after operation; it is highly specific because pathologists would not make a positive diagnosis until they observe definite tumor cells under the microscope. However, the Se of FNAC was only 0.82, as suggested by the pooled analysis in the current study (Table 3). Noticeably, the quality of aspiration material is largely dependent on the US guidance avoiding targets lacking cell components such as cystic part or necrotic areas [36]. Although the interpretation of conventional cytology is somewhat dependent on subjective experience of pathologists, it could achieve a more robust diagnostic efficacy when combined with other tests.

Since first reported in 1921 [38], the Se of the Tg level in the needle washout specimens for diagnosis of LNM has been well documented. Tg is only produced by the thyroid follicular cells and involved in thyroid hormone synthesis and iodine transport [7]. Like other auxiliary diagnostic examinations, such as BRAF mutation in cytology samples [39], the FNA-Tg test would not add patients’ extra discomfort as it is an adjunct to FNA biopsy. Its interpretation is less dependent on personal experience than cytology. FNA-Tg could enhance FNAC diagnostic accuracy by decreasing false negative results at a modest cost [16]. Our analysis also indicated that the combination of FNAC and FNA-Tg could increase the pooled Se by 0.15 compared with FNAC along (Table 3). The DOR values of FNAC, FNA-Tg, and FNAC+FNA-Tg were 183.75, 155.17, and 446.00 in the total group, respectively, indicating the combination of FNAC and FNA-Tg has a stronger diagnostic power in discriminating LNM of DTC.

A quantitative measurement of Tg in LN aspirate samples is an objective indicator of DTC metastasis prior or after thyroidectomy. However, no definite consensus has been reached for the diagnostic threshold value of FNA-Tg worldwide [26]. Some researchers also mentioned that Tg values close to the cut-offs should be interpreted with caution. They may represent a minute metastatic focus, and conversely, Tg values before surgery might represent a false positive due to contamination with blood [30]. To avoid the interference of blood Tg, some studies suggested to use FNA-Tg/serum-Tg ratio as the positive standard instead [33, 36]. Our subgroup analysis reflected that the FNA-Tg performed less accurately using FNA-Tg/serum-Tg ratio than using Tg level as cut-off as a single test, but it could perform quite excellently when combined with FNAC using FNA-Tg/serum-Tg ratio as cut-off, especially in Sp (AUC 0.9860 and Sp 0.98). Obviously, no matter which FNA-Tg cut-off was employed, the diagnostic index such as Se, PLR, NLR, DOR, AUC, and Q value of the combination of FNA-Tg and FNAC was generally higher than that of FNA-Tg alone, again stressing the importance of combining FNA-Tg with FNAC in LNM detection in patients with DTC.

In a meta-analysis, intra-study variation and inter-study variation are the main causes for heterogeneity. According to our results, no threshold effect was detected in any included study; however, the non-threshold effect evaluated by I2 and Cochran’s Q statistics was significant. Therefore, we adopted some measures to reduce the harmful influence of heterogeneity on overall conclusion: using a random effects model, reanalyzing data after deleting deviated studies, and conducting subgroup analyses. It should be noted that different detective combination should be chosen according to different detective purposes. For example, test combination with high Se, like FNAC+FNA-Tg with 0 < cut-off < 1.0 as cut-off, is suitable for screening in large-scale population; combination with high Sp, such as FNAC+FNA-Tg with FNA-Tg/serum-Tg > 1 as cut-off, is appropriate to assist in making a surgery plan. To avoid the risk of false negative results, close follow-up should be given when a less sensitive test, such as FNAC, was applied alone preoperatively or postoperatively.

Our study has several strengths and limitations. We included a large number of LNs and patients to improve the robustness of our analysis, and a search of the gray literature was performed and full texts available in other languages than English were also referred, which has increased the general applicability of our conclusions. Moreover, subgroup analyses were carefully conducted to explore the influence of patient status or FNA-Tg cut-offs on diagnostic performance. Admittedly, the number of involved studies, especially the studies using FNA-Tg/serum-Tg ratio as cut-off in Tg detection, was relatively limited. Additionally, significant non-threshold effect existed in the incorporated data. Further subgroup analysis suggested that different FNA-Tg cut-offs could add on the inter-study heterogeneity, but it did not impact the stability of the integrated outcomes very much, and the data still supported the conclusions on the whole.

Conclusion

In summary, FNA-Tg was more sensitive but less specific than FNAC in identifying LNM of DTC. The addition of FNA-Tg, especially the FNA-Tg/serum-Tg ratio, to FNAC could improve the diagnostic accuracy of LNM in patients with DTC and provide more comprehensive clues for preoperative staging and postoperative follow-up.