Introduction

Accurate diagnosis informs adequate treatment in clinical practice. Understanding the accuracy of diagnostic tests allows clinicians to discern between futile tests and those that can significantly advance the diagnostic process. Yet, clinicians often have inaccurate expectations regarding the benefits and harms of medical tests [1].

Despite the clinical importance of diagnostic tests, research regarding diagnosis usually lags behind its treatment counterpart, both in terms of quantity and quality [2, 3]. An analysis of research in the field of endocrinology suggests knowledge gaps (including those related to diagnosis) are not adequately addressed by ongoing research studies [4]. The Institute of Medicine’s (IOM) report on diagnostic errors, recognizes this knowledge gap, and highlights the need for the research community to improve the understanding of the diagnostic process [2].

Systematic reviews (SRs), summarizing the body of evidence, provide clinicians, patients, guideline panelists, and policymakers with best estimates of the accuracy of diagnostic tests and their usefulness in clinical practice. They allow clinicians to evaluate the body of available evidence instead of using only the latest, largest, or most well-known study to inform their practice. In order for clinicians to apply the results in practice, SRs should: (1) produce results that are clinically useful and (2) have credible and reproducible methods for synthesizing and reporting the evidence. Our logic model (Fig. 1) evaluates the presence of criteria that can help clinicians evaluate SRs in general (e.g., performing an assessment of risk of bias) and diagnostic reviews in particular (e.g., including only patients with diagnostic uncertainty) [5, 6].

Fig. 1
figure 1

Logic model

In accordance with IOM’s recommendations, an assessment of the literature can help identify useful diagnostic tests in endocrinology and the conditions that warrant investment of scarce research resources to improve the diagnostic process. To this end, we identified all SRs and meta-analyses addressing the diagnostic accuracy of different tests in the field of endocrinology. Here, we summarize their findings along with the quality and reporting of the methods used in order to determine if the diagnostic evidence syntheses in endocrinology are: (a) clinically useful and (b) credible.

Methods

Eligibility criteria

We included SRs and meta-analyses reporting accuracy measures of diagnostic tests used in patients under evaluation for any endocrine condition. Endocrine conditions were defined as those that belonged to one of the following categories: (1) bone, (2) diabetes/glycemia, (3) neuroendocrine tumors, (4) pituitary–gonadal–adrenal, and (5) thyroid. Eligible studies reported clinically applicable diagnostic accuracy measures using: sensitivity, specificity, positive predictive value, negative predictive value, positive likelihood ratio, negative likelihood ratio, diagnostic odds ratio, diagnostic accuracy, or the area under the curve. Studies not reporting any of these diagnostic measures were excluded without author contact.

Search methods for identification of studies

A comprehensive search of MEDLINE, EMBASE, and Cochrane CENTRAL from inception to December 2015 without language restrictions was designed by an experienced medical librarian (P.J.E.) with input from the study’s principal investigators (N.S.O. and R.R.G.). Controlled vocabulary supplemented with keywords was used to search for SRs of diagnostic tests in endocrinology (Online Resource 1). We consulted Mayo Clinic experts in the field to identify references that could have been missed by our search.

Selection of studies and data management

The search results were uploaded into SR software (DistillerSR, Ottawa, Canada, https://distillercer.com/products/distillersr-systematic-review-software/). Six reviewers, working independently and in duplicate, reviewed all abstracts and titles for inclusion (J.P.B., N.I.A., R.R.G., N.S.O., G.S.B., and S.T.) and assessed the eligibility of full-text publications retrieved if at least one reviewer considered the abstract eligible. Disagreements were resolved by arbitration (a third reviewer decided whether the study should be included). Chance-adjusted agreement was quantified using the kappa statistic [7].

Data collection

Reviewers performed data collection independently and in duplicate using a standardized form and data extraction instructions after calibrating with two included full texts. Extracted data included: (a) general information about the review (first author, year of publication, country, condition of interest, diagnostic test, gold standard used, prevalence of the disease in the studied population, number of articles and patients included); (b) clinical performance of the diagnostic test (pooled summary statistics); and (c) methods and data reporting (search strategy, language restrictions, review of references, independent duplicate process, clear clinical question, selection criteria, summary of included studies, predefined subgroup analyses, methods used for data representation, method used for analysis, evaluation of risk of bias and tool used, author contact description, publication bias assessment, heterogeneity assessment, and assessment of the confidence warranted by the evidence).

In order to determine the clinical usefulness of diagnostic tests, subgroup analyses were performed including only studies in which a likelihood ratio (a measure of diagnostic accuracy) was provided or could be calculated from the available data (e.g., when sensitivity and specificity were provided). We chose likelihood ratios for this analysis since this diagnostic variable allows comparison of results across studies, can be applied regardless of the prevalence of a condition, and is easily understood and applied by physicians when expressed in non-technical language [8]. In cases in which a single diagnostic test had more than one review available, we included the study with the best clinical performance: a best-case scenario analysis.

We used descriptive statistics to summarize extracted variables. Authors’ descriptions of “risk of bias” and “quality” were considered to be reflecting the same construct [9]. Author ratings of studies as “high” or “moderate to high” quality were coded as low-risk of bias and “low” or “acceptable” quality ratings were coded as high-risk of bias. We considered assessments of “confidence in the evidence” to be only those that used the results of the review to assess the quality of the body of evidence. Bivariate and hierarchical models were considered to be adequate methods for statistical analyses [10]. When a description of methods for statistical analysis was provided, reviews that did not use these methods were coded as “other” if another method (e.g., Moses–Littenberg random effects) was used or “unclear” when the description was insufficient to ascertain the method used.

Results

The database search generated 2122 eligible reports. After abstract and title screening, 115 reports were identified for full-text review. After reproducible full-text screening (κ statistic = 0.89) [7], 84 studies were found to be eligible [1195] (Fig. 2). The majority of the included studies evaluated diagnostic tests related to thyroid conditions (n-42, 50%), most commonly thyroid cancer, or diabetes/glycemia (n-19, 23%). Five SRs did not include a meta-analysis. General characteristics of the included studies are reported in Table 1.

Fig. 2
figure 2

PRISMA flowchart (study selection process)

Table 1 Summary of included systematic reviews

Forty-four of the 84 included reviews, evaluating 65 diagnostic tests, provided a pooled estimate of the likelihood ratio or enough data to calculate it. The positive likelihood ratio (+LR) of these tests is shown in Fig. 3 and the negative likelihood ratio (−LR) in Fig. 4. Forty-seven percent of the 65 diagnostic tests had a positive likelihood ratio of ≥10. Only 35% of these studies (reporting helpful +LR), however, reported an overall statement of the risk of bias of the included studies. Online Resource 2 shows a summary of diagnostic test with +LR ≥ 10, considered to be most useful to “rule-in” a diagnosis in endocrinology.

Fig. 3
figure 3

Positive likelihood ratios of the included studies. Y-axis, LR+ (positive likelihood ratio). Risk of bias: blue: low to moderate; red: high; black: not reported. NeT: Neuroendocrine tumors. Diagnostic tests included: (1) MET-PET for localization of parathyroid adenoma. (2) MIBG for localization in secondary hyperparathyroidism. (3) Calcaneal quantitative ultrasound (T-score threshold − 2.5) for osteoporosis. (4) Anti-mullerian hormone for polycystic ovarian syndrome. (5) FDG PET for malignant adrenal mass. (6) Late night salivary cortisol for Cushing’s syndrome. (7) Urinary free cortisol for Cushing’s syndrome. (8) 2 mg dexamethasone suppression test for Cushing’s syndrome. (9) Urinary free cortisol plus dexamethasone suppression test for Cushing’s syndrome. (10) Urinary free cortisol plus salivary midnight cortisol for Cushing’s syndrome. (11) Urinary free cortisol plus salivary midnight cortisol plus dexamethasone suppression test (all tests) for Cushing’s syndrome. (12) IGF-1 for growth hormone deficiency. (13) IGFBP-3 for growth hormone deficiency. (14) Growth hormone releasing peptide 6 for growth hormone deficiency. (15) Serum growth hormone levels for growth hormone deficiency. (16) Glucagon stimulation test for growth hormone deficiency. (17) ITT for growth hormone deficiency. (18) 250 μg cosyntropin stimulation test for primary adrenal insufficiency. (19) 250 μg cosyntropin stimulation test for secondary adrenal insufficiency. (20) 123I-MIBG for localization of pheochromocytoma. (21) Endoscopic ultrasound for pancreatic neuroendocrine tumors. (22) Chromogranin A for neuroendocrine tumors. (23) Gallium-68 somatostatin receptor PET and PET/CT for thoracic and gastroenteropancreatic neuroendocrine tumors. (24) 18F DOPA PET or PET/CT for thoracic and gastroenteropancreatic neuroendocrine tumor. (25) 18F DOPA PET or PET/CT for pheochromocytoma/paraganglioma. (26) Telemedicine for diabetic retinopathy (absence). (27) Telemedicine for diabetic macular edema. (28) Hemoglobin A1c for postpartum abnormal glucose tolerance. (29) HbA1c+ 2 SD for diabetes. (30) Albumin urine concentration for microalbuminuria. (31) Ratio of albumin to creatinine for microalbuminuria. (32) Plaster for diabetic neuropathy. (33) Hemoglobin A1C for gestational diabetes. (34) QTc prolongation for diabetic autonomic failure. (35) MICRAL dipstick for diabetic microalbuminuria. (36) Probe to bone test for osteomyelitis in patients with diabetes. (37) Plain radiography for osteomyelitis in patients with diabetes. (38) MRI for osteomyelitis in patients with diabetes. (39) Bone scan Tc for osteomyelitis in patients with diabetes. (40) Bone scan for osteomyelitis in patients with diabetes. (41) Optical coherence tomography for diabetic macular edema. (42) 18F-dihydroxyphenylalanine positron emission tomography for congenital hyperinsulinism. (43) Pancreatic venous sampling for focal congenital hyperinsulinism. (44) 18F-DOPA PET for focal congenital hyperinsulinism. (45) Real-time elastography for thyroid cancer. (46) Core needle biopsy for thyroid malignancy. (47) Bethesda system for reporting thyroid cytopathology (only malignant and benign) for thyroid cancer. (48) Frozen section for follicular lesions, thyroid cancer. (49) Frozen section for non-follicular lesions (thyroid cancer). (50) Frozen section for thyroid lesions, not otherwise specified (thyroid cancer). (51) Contrast-enhanced ultrasound for thyroid cancer. (52) MicroRNA for thyroid cancer. (53) Thyroid ultrasound features (taller than wide) for thyroid cancer. (54) Thyroid US features spongiform for thyroid cancer. (55) Standardized uptake value (SUVmax) for thyroid cancer. (56) Thyroglobulin washout for thyroid cancer. (57) Diffusion weighted MR imaging for thyroid cancer. (58) Serum thyroglobulin with immunometric assay (with ablation, during thyroxine) for thyroid cancer recurrence. (59) B-RAF V600E for thyroid cancer. (60) 99mTC methoxyisobutylisonitrile scintigraphy for thyroid cancer. (61) FDG PET for thyroid cancer. (62) GAL-3 for thyroid cancer. (63) HBME-1 for thyroid cancer. (64) 3G TRAB assay for Graves’ disease. (65) Ultrasonography for cervical lymph node metastases in papillary thyroid cancer

Fig. 4
figure 4

Negative likelihood ratios of the included studies. Y-axis: LR− (negative likelihood ratio). Risk of bias: blue: low to moderate; red: high; black: not reported. NeT: Neuroendocrine tumors. Diagnostic tests included: (1) MET-PET for localization of parathyroid adenoma. (2) MIBG for localization in secondary hyperparathyroidism. (3) Calcaneal quantitative ultrasound (T-score threshold −2.5) for osteoporosis. (4) Anti-mullerian hormone for polycystic ovarian syndrome. (5) FDG PET for malignant adrenal mass. (6) Late night salivary cortisol for Cushing’s syndrome. (7) Urinary free cortisol for Cushing’s syndrome. (8) 2 mg dexamethasone suppression test for Cushing’s syndrome. (9) Urinary free cortisol plus dexamethasone suppression test for Cushing’s syndrome. (10) Urinary free cortisol plus salivary midnight cortisol for Cushing’s syndrome. (11) Urinary free cortisol plus salivary midnight cortisol plus dexamethasone suppression test (all tests) for Cushing’s syndrome. (12) IGF-1 for growth hormone deficiency. (13) IGFBP-3 for growth hormone deficiency. (14) Growth hormone releasing peptide 6 for growth hormone deficiency. (15) Serum growth hormone levels for growth hormone deficiency. (16) Glucagon stimulation test for growth hormone deficiency. (17) ITT for growth hormone deficiency. (18) 250 μg cosyntropin stimulation test for primary adrenal insufficiency. (19) 250 μg cosyntropin stimulation test for secondary adrenal insufficiency. (20) 123I-MIBG for localization of pheochromocytoma. (21) Endoscopic ultrasound for pancreatic neuroendocrine tumors. (22) Chromogranin A for neuroendocrine tumors. (23) Gallium-68 somatostatin receptor PET and PET/CT for thoracic and gastroenteropancreatic neuroendocrine tumors. (24) 18F DOPA PET or PET/CT for thoracic and gastroenteropancreatic neuroendocrine tumor. (25) 18F DOPA PET or PET/CT for pheochromocytoma/paraganglioma. (26) Telemedicine for diabetic retinopathy (absence). (27) Telemedicine for diabetic macular edema. (28) Hemoglobin A1C for postpartum abnormal glucose tolerance. (29) HbA1c+ 2 SD for diabetes. (30) Albumin urine concentration for microalbuminuria. (31) Ratio of albumin to creatinine for microalbuminuria. (32) Plaster for diabetic neuropathy. (33) Hemoglobin A1C for gestational diabetes. (34) QTc prolongation for diabetic autonomic failure. (35) MICRAL dipstick for diabetic microalbuminuria. (36) Probe to bone test for osteomyelitis in patients with diabetes. (37) Plain radiography for osteomyelitis in patients with diabetes. (38) MRI for osteomyelitis in patients with diabetes. (39) Bone scan Tc for osteomyelitis in patients with diabetes. (40) Bone scan for osteomyelitis in patients with diabetes. (41) Optical coherence tomography for diabetic macular edema. (42) 18F-dihydroxyphenylalanine positron emission tomography for congenital hyperinsulinism. (43) Pancreatic venous sampling for focal congenital hyperinsulinism. (44) 18F-DOPA PET for focal congenital hyperinsulinism. (45) Real-time elastography for thyroid cancer. (46) Core needle Biopsy for thyroid malignancy. (47) Bethesda system for reporting thyroid cytopathology (only malignant and benign) for thyroid cancer. (48) Frozen section for follicular lesions, thyroid cancer. (49) Frozen section for non-follicular lesions (thyroid cancer). (50) Frozen section for thyroid lesions, not otherwise specified (thyroid cancer). (51) Contrast-enhanced ultrasound for thyroid cancer. (52) MicroRNA for thyroid cancer. (53) Thyroid ultrasound features (taller than wide) for thyroid cancer. (54) Thyroid US features spongiform for thyroid cancer. (55) Standardized uptake value (SUVmax) for thyroid cancer. (56) Thyroglobulin washout for thyroid cancer. (57) Diffusion weighted MR imaging for thyroid cancer. (58) Serum thyroglobulin with immunometric assay (with ablation, during thyroxine) for thyroid cancer recurrence. (59) B-RAF V600E for thyroid cancer. (60) 99mTC methoxyisobutylisonitrile scintigraphy for thyroid cancer. (61) FDG PET for thyroid cancer. (62) GAL-3 for thyroid cancer. (63) HBME-1 for thyroid cancer. (64) 3G TRAB assay for Graves’ disease. (65) Ultrasonography for cervical lymph node metastases in papillary thyroid cancer

Twenty-six percent of the tests had a negative likelihood ratio of ≤0.10, only five of which reported an overall statement of risk of bias. Online Resource 3 shows a summary of diagnostic tests with −LR ≤ 0.10, considered to be most useful to “rule-out” a diagnosis in endocrinology. The most useful test, when positive or negative was the composite of free cortisol, salivary midnight cortisol, and dexamethasone suppression test for ruling in or out Cushing’s syndrome.

The most commonly reported summary statistics were sensitivity and specificity (94 and 87%, respectively) (Online Resource 4). The most common way of representing findings was the use of forest plots (68%) followed by receiver operator curves (65%). Only one meta-analysis presented the results using a Fagan nomogram, and three used other graphical representations (Online Resource 5).

About half of the reviews clearly reported including only patients in which there was diagnostic uncertainty and one in three reported the prevalence of the condition in the study population. Only five studies assessed the overall confidence merited by the body of evidence with unclear methods and limited conclusions provided (Fig. 5, Online Resource 6).

Fig. 5
figure 5

Is the evidence synthesis credible and useful? Asterisk only applies to meta-analyses

Most of the reviews reported a clear gold standard (93%), a comprehensive search strategy, a well-defined clinical question, and clear inclusion criteria; reviewed primary studies for references; and provided a summary table of the included studies. Only about 31% reported that their search was unrestricted by language. Approximately one in five of the reviews reported contacting authors, with a mean author response rate of 57% (range 16–91%) (Fig. 5, Online Resource 6).

The majority of the reviews reported assessing the risk of bias of the included studies (69%), but only one-third provided readers with an overall statement regarding the risk of bias of the included studies. When available, this information was reported in the abstract of the study in 7% of the cases and in the conclusion of the manuscript in 22%. Assessment of publication bias was reported by 40% of authors; the most common method was use of funnel plots. When a conclusion about publication bias was made, most (87%) were found to have insignificant publication bias.

The statistical method used for the diagnostic meta-analysis was not reported or was unclear in 8 (10%) of the reviews. Adequate methods of statistical analysis were reported in 16 (20%) of the meta-analyses. Other methods were reported in 70% of the reviews; the most common description was “random effects model”. Pre-specified subgroup analyses were reported in 34% of the studies. The majority of the reviews assessed for heterogeneity (77%). Most commonly, the Cochrane’s Q or the I 2 statistic method was used; and 86% of those with a conclusion reported significant or high heterogeneity. Exploration of heterogeneity was performed in 38% of meta-analyses mostly through subgroup analyses or meta-regression. Results changed in about half of the reviews that reported the results of exploring heterogeneity (14/30).

Discussion

We performed a systematic evaluation of reviews and meta-analyses evaluating diagnostic tests in endocrinology to determine the clinical usefulness of the included tests and the credibility of the syntheses. Most of the included reviews focused on glycemia/diabetes and thyroid conditions. Clinically, ~50% of the diagnostic tests had a positive likelihood ratio equal to or greater than 10, significantly changing the probability of disease and helping clinicians rule in a disease if the test is positive. About 25% of the tests had a negative likelihood ratio ≤0.10, and can significantly decrease the probability of disease and help clinicians rule out a disease process. These diagnostic tests (Online Resources 2 and 3) can therefore be considered among the most useful tests in day-to-day endocrine clinical practice. This conclusion, however, is limited by the quality of information regarding the risk of bias of the included studies.

Likelihood ratios were used as summary statistics in 43% of the reviews and applied in a Fagan Nomogram in only one. As in previous evaluations, the most common measures used to report diagnostic accuracy were sensitivity and specificity [96]. Studies using hypothetical scenarios have shown that physicians integrate diagnostic performance information into their clinical assessment more accurately when it is provided as a likelihood ratio described in non-technical wording [8] and that probability modifying plots and natural frequency trees might help clinicians more accurately interpret diagnostic test results [97].

Empirical evidence suggests that characteristics of the sampled population can bias the results of evaluations of diagnostic test performance [98, 99]. About half of systematic reviewers reported only including primary studies with patients with diagnostic uncertainty and one-third reported the prevalence of the condition in the study population. These findings limit clinician’s abilities to assess the applicability of findings of SRs in their clinical settings.

Although most reviews reported credible methods for study identification and selection, included a clear gold standard, and evaluated the risk of bias of the studies, a limited number used appropriate statistical methods, contacted authors, or provided an overall summary of the risk of bias of the included studies. In 2006, a SR that evaluated 89 reviews of diagnostic tests in oncology found that reviews commonly defined inclusion criteria, provided a summary of the included studies, and reported performing crucial steps in duplicate. Quality assessment of the included studies was performed in 61% of the studies and a formal assessment provided in only 30% [6]. Our results, in a different field of medicine and 10 years later, are quite similar.

The number of reviews of diagnostic tests in endocrinology that reported author contact (~20%) to obtain primary information was lower than in SRs in general (50–85%) [100], though this procedure can lead to changes in the estimates of clinical utility of tests [101]. In a study of 114 SRs, half of the authors conducted an explicit assessment of the quality of the individual studies that were included and in 64 of the cases the results of these evaluation were presented in a table [9]. We found that although most reviews evaluated the quality of the included studies, most failed to provide an overall statement for the reader and a minority (7%) reported this information in the abstract. In addition, the majority of the studies included some description of statistical methods used, but only 20% used bivariate or hierarchical models, recommended by experts for diagnostic meta-analyses. These methods are preferred because they simultaneously summarize sensitivity and specificity [10, 102104].

SRs and meta-analyses should include assessments of the body of evidence based on the study design and risk of bias of the included studies, indirectness, inconsistency across results, imprecision, and risk of publication bias [105]. We found that most diagnostic SRs in endocrinology fail to provide an overall statement about study quality, do not commonly evaluate the risk of publication bias, explore inconsistencies, or provide an overall assessment of the confidence in the evidence. This limits the results of these reviews to a summary estimate without assessing the confidence merited by the results or providing actionable insight to improve the quality of the body of evidence (identification of knowledge gaps and areas for research).

Diagnostic SRs are critical in clinical practice to: (1) identify the most useful clinical tests and (2) help identify areas where further research is needed. These benefits can be hampered if the methodology used to perform and report these reviews is not adequate. Our review suggests multiple areas where authors of SRs of diagnostic tests could improve such as: author contact, exploration of heterogeneity (e.g., sensitivity analysis), use of advanced statistical methods, reporting of the overall risk of bias of included studies, and assessing the confidence in the body of the evidence.

We performed subgroup analyses to identify those studies that would be more useful in clinical practice; however, this was limited to the included tests (useful tests for which a SR has not been done are missing) and those that provided information that allowed us to calculate a likelihood ratio. Diagnostic accuracy estimates and risk of bias assessment are just a few of the components that can be used to determine if a test would actually benefit a patient. Diagnostic test studies are not usually linked to a patient important outcome; they provide only indirect evidence of potential benefits that depend on other factors such as: availability of treatment for a potential disease, patient context, and so on [105].

This SR audit summarizes the accuracy of diagnostic tests in endocrinology, the reporting of these reviews, and helps to identify knowledge gaps in diagnosis. Several limitations exist to the application of our results in clinical practice. First, our search may have missed SRs not indexed under endocrinology. Second, we did not assess the quality of studies included in SRs and relied (when available) on authors’ assessments of quality for our analysis. Importantly, while best methods to perform and report a SR and meta-analysis of healthcare interventions are delineated by PRISMA [106], best standards for SRs of diagnostic tests may differ [107].

Conclusions

Almost half of the tests in which a LR could be calculated or was provided produced significant variations in the pre-test probability of disease and are very likely to be helpful when positive; ~25% of the tests are very likely to be helpful when negative. In general, most reviews of diagnostic tests in endocrinology followed acceptable general methods for study identification, screening, and extraction. Most of the reviews reported evaluating the risk of bias (rarely providing an overall statement), but only 23% contacted authors and 20% used adequate statistical methods. As a result, the overall confidence in the diagnostic estimates provided by these studies is limited.

Progress in the field of diagnostic tests in endocrinology should be supported by standardized methods and reporting for SRs and meta-analyses. These standards should include use of adequate statistical methods and overall statements about the confidence in the body of evidence evaluating these tests. These evidence summaries should provide evidence to help clinicians and patients discuss the usefulness of the tests and the trustworthiness in the evidence producing these estimates.