Introduction

Thyroid nodular disease is a frequent condition, more common in women than in men, and increases with age [1]. The majority of thyroid nodules are benign, with malignancy rates of less than 5–10% [2, 3]. Differentiated thyroid carcinomas (DTC) encompass papillary thyroid carcinoma (PTC) and follicular thyroid carcinoma (FTC), and account for > 90% of primary thyroid carcinoma [4]. DTC has a favorable prognosis and a high cure rate following treatment [4, 5]. However, recurrence is seen in 20–30% of patients followed for 10–20 years [6,7,8,9].

The main goal of the diagnostic work-up of thyroid nodules is to rule out malignancy. A multi-disciplinary risk-stratification system is applied to assess functional and morphological characteristics of the thyroid as well as the risk of malignancy [3, 10, 11]. The cornerstones of thyroid nodule diagnostics are ultrasonography (US) and fine needle aspiration biopsy (FNAB). Patient selection for these methods follows an assessment of clinical risk factors for thyroid malignancy, including objective findings, thyroid function tests, and the result of 99Tc-scintigraphy [3, 11]. The indication for FNAB is based on a risk assessment of certain US features suggestive of malignancy, while the results of FNAB triage the patients into a management strategy according to the Bethesda System for Reporting Thyroid Cytopathology (BSRTC) [10, 12, 13]. However, in a significant subset of patients, preoperative tests cannot distinguish malignant from benign nodules, and in these cases, thyroid surgery is recommended for histopathological verification of the diagnosis [3, 12]. A diagnostic challenge applies in particular to patients with FNAB categorized into the heterogeneous group of indeterminate results (BSRTC 3–5) and to those with persistently non-diagnostic (ND, BSRTC 1) samples [12, 14].

The increased detection of thyroid nodules by imaging, combined with the low rate of thyroid nodules harboring malignancy [3], calls for improved preoperative risk-stratification tools to depicting patients with thyroid carcinoma. This would potentially allow a more individualized management, eventually leading to fewer diagnostic thyroid operations.

Elastography

The palpation of tumors for an assessment of tissue stiffness is a fundamental and ancient clinical examination used in clinical practice. Generally, malignant tumors are believed to be stiffer than benign ones, but there are exceptions (e.g., fibrosis or cystic areas). Furthermore, findings by palpation are highly investigator dependent [15]. Ultrasound elastography measures tissue elasticity (stiffness) in a more objective manner, based on an automatic detection of tissue movements induced by externally applied forces [16]. Elastographic methods are categorized according to the external force employed, the measured quantity, and the manner how results are displayed [17]. The multiple ways of categorizing the various technologies have led to terminological inconsistency, and various acronyms may apply to similar methods provided by different manufacturers.

The quasi-static method, termed strain elastography (SE), relies on manually applied pressure on the transducer by the investigator or from carotid artery pulsation to cause tissue deformation [18]. From the registration of tissue deformation (strain), being inversely related to tissue stiffness, a qualitative elasticity map or semi-quantitative ratio of elasticity measurement is displayed [16, 18]. SE was the first elastographic method available for thyroid evaluation. However, strain of a thyroid nodule is influenced by adjacent thyroid tissue and moreover, SE has limitations in the evaluation of multinodular goiter, deeply located nodules, and in nodules containing calcifications [16, 19].

The dynamic methods covering acoustic radiation force impulse (ARFI) imaging, 2D shear wave elastography (SWE), and point SWE (pSWE) apply standardized acoustic impulses from the US transducer to induce minute tissue movements resulting in transverse shear waves. The shear wave speed is then registered and translated into a quantitative measurement of elasticity [16, 18, 19]. These technologies are believed to be less user dependent and, when introduced, considered more technologically advanced compared with SE [18]. The technologies differ with regard to the size of the measured area, the method for tissue displacement, and their way to display the elasticity signal alongside the quantitative measurements [16].

US elastography constitutes a natural extension of gray-scale US in the evaluation of thyroid nodules. Several studies, employing different methodologies, have shown promising results of elastography in the evaluation of thyroid nodules, with reports of lower elasticity (i.e., higher stiffness) in malignant than in benign thyroid nodules [20,21,22].

Shear wave elastography

Real-time 2D SWE (hereafter termed SWE) measures real-time tissue elasticity, quantified as an elasticity index (EI) expressed in kilopascal (kPa or m/s), along with a qualitative color-coded elasticity map [16, 23, 24]. The technology exploits the registration of shear waves generated by tiny tissue movements resulting from acoustic impulses emitted from the US transducer along a pushing line [16]. The speed of the shear waves along several simultaneous pushing lines is registered by the US apparatus and is closely related to tissue elasticity, applying Young’s modulus [17]. Hereby, a live color-coded elasticity map of 2 × 3 cm is displayed overlaying the gray-scale US image [16], and with corresponding measurements of elasticity expressed in kPa (Fig. 1). After freezing the elasticity map, the investigator places a movable and size adjustable region of interest (ROI), whereby quantitative EI measurements are displayed (Fig. 1).

Fig. 1
figure 1

SWE image depicting the SWE acquisition- and ROI placement process. a Frozen SWE image; b the same SWE image with ROIs and, to the right, EI measurements

SWE is believed to be less user dependent than qualitative or semi-quantitative quasi-static methods, as SWE uses acoustic impulses generated by the transducer to measure elasticity, rather than external pressure applied by the investigator [16, 19]. Further, the technology allows for assessment of the distribution and heterogeneity of elasticity within a large area, both qualitatively and quantitatively, with the possibility of avoiding artifacts during ROI placement. Also, elasticity is quantified by EI, allowing for comparison between adjacent areas (EI ratio) or registration of changes over time. Finally, split-screen mode allows simultaneous assessment of morphology and elasticity, ensuring that EI is measured within the index nodule [16].

A higher EI in thyroid carcinoma compared with benign thyroid nodules has been reported [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39], and SWE has been proposed as an adjuvant to conventional gray-scale US [24, 29, 30]. However, EI values of malignant and benign nodules overlap, and cut-off points for the differentiation between thyroid carcinoma and benign nodules diverge, ranging from 22 to 94 kPa [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39].

Thus, the clinical application of SWE is unclear, despite the fact that the first report was published ten years ago [25]. Therefore, we reviewed studies of SWE, with special focus on the clinical applicability, and balancing the advantages and the shortcomings of the method.

Materials and methods

A comprehensive literature search (24th February 2020) was conducted in PubMED using the terms “thyroid” AND “shear wave elastography” or “thyroid” AND “elastography”. Reference lists of screened and included papers were reviewed for additional studies. First, titles were screened, then abstracts, and finally full texts were read by one author (KZS). The flowchart of literature selection is depicted in Fig. 2. The inclusion criteria were: (1) SWE performed in diagnostic studies of thyroid nodules in regard to differentiating malignant from benign nodules or studies of thyroid nodular SWE reproducibility, (2) Quantitative assessment reporting EI measurements, (3) adult patients (≥ 18 years), and (4) English language. The included studies are listed in Tables 1, 2. No meta-analysis assessing diagnostic performance was performed due to large heterogeneity across studies in regard to elasticity measurements and cut-off points.

Fig. 2
figure 2

Flow-chart of study selection a5 studies assess both diagnostic properties and reproducibility of SWE

Table 1 Studies evaluating the diagnostic accuracy of SWE of thyroid nodules
Table 2 Studies evaluating the SWE reproducibility of thyroid nodules

Results

Diagnostic properties of thyroid SWE

Many authors reported higher EI in malignant compared with benign nodules [25,26,27,28,29,30,31,32,33,34,35, 37, 40,41,42,43,44,45,46]. During the past few years, less encouraging results have emerged, as two studies reported almost similar EI values for benign and malignant thyroid nodules [47, 48].

Original studies investigating the diagnostic performance of thyroid SWE are summarized in Table 1. Significant heterogeneity exists regarding the optimum parameter for elasticity assessment, definition of ROI for EI measurements, EI cut-off points, elasticity scale settings, and the scanning planes used [18, 49]. Elasticity assessments around the stiffest area of the nodule are the most common EI outcomes reported, reflected by measures of mean and maximum elasticity.

The majority of the proposed EI cut-off points (Table 1) reflect a suboptimal diagnostic accuracy, as determined by the area under the curve (AUC) in ROC analyses. AUC ranged from 0.61 to 0.94, and 89% of the studies show an AUC within the range 0.70–0.90. Specificity and sensitivity were 0.48–0.97 and 0.42–0.95, respectively, (Table 1). The first report on this particular method found an exceptionally high AUC of 0.94 [25], but subsequent studies failed to reproduce such an encouraging result. Thus, although several studies provide fairly operational EI cut-off levels on a group basis, the diagnostic value in the individual patient is suboptimal, explained by the huge overlap in EI between benign and malignant nodules (Table 1).

Several meta-analyses of the diagnostic accuracy of thyroid SWE have been performed, and with diverging results [20, 21, 49,50,51,52,53,54]. Although the pooled sensitivity and specificity found in some studies seem encouraging, the clinical utility of these analyses is highly questionable, as various technologies (SWE, pSWE, ARFI) were pooled, patient cohorts were highly heterogeneous, and a range of different cut-off points were used for determining the outcomes. One meta-analysis, including SWE studies only, found suboptimal performance of the method, reflected by a sensitivity and a specificity of 0.66 and 0.78, respectively [51].

Two recent studies evaluated a novel 3D SWE method [38, 41]. 3D SWE uses a similar technology as in 2D SWE, but by inclusion of volume measurements, a three-dimensional EI map is generated with several image slices. This technology might seem promising but the diagnostic accuracy turned out to be similar or even lower than achieved by 2D SWE [38, 41]. Thus, this extended version of SWE seems not to overcome the current limitations of the technology.

Efforts have been made to establish a relevant subgroup of patients, in whom SWE would improve the accuracy of thyroid nodular risk-stratification [26, 31, 36, 37, 47, 48]. However, as such a subgroup remains to be identified, and the risk of misclassifying thyroid nodules is currently unacceptably high, if based only on SWE results.

SWE has been suggested as an add-on investigation to gray-scale US to select patients for FNAB or surgery, rather than a separate diagnostic tool replacing conventional US or even FNAB [22, 24, 29, 30]. The diagnostic accuracy of SWE as an adjuvant examination has been addressed in several studies, but results have been conflicting [25,26,27, 30, 41, 44, 45, 55]. Although sensitivity may increase when combining US and SWE, as compared with SWE or US alone, a decline in specificity is seen [26, 27, 29, 30, 41]. Some studies reported no change in the diagnostic performance when combining SWE and US compared with SWE alone, whereas other reports were more positive [25, 31, 44, 45]. In a recent study [56], a combination of qualitative assessment of SWE images and a preoperative BRAF gene detection was superior to the individual performance of both tools in terms of identifying cancer. The combined method showed a sensitivity of 93% and a specificity of 95% [56]. Such diagnostic performance seems high but needs to be confirmed in prospective studies.

For the identification of malignant nodules embedded in a multinodular goiter, SWE has shown better performance than indices of nodule size or suspicious US features [57]. SWE may also be employed for prognostication of PTC, as a positive association seems to exist between EI and factors like extra-thyroidal extension, multifocality, and central lymph node metastasis [58].

Factors influencing diagnostic accuracy

The EI is influenced by the histological properties of the nodule. PTC—the most common type of TC—is characterized by higher EI as compared with benign nodules as well as other thyroid malignancies [36, 47, 59]. Accordingly, studies including a high percentage of PTC reported the highest EI cut-off points [27, 30], while few cases of PTC result in either a low EI cut-off point [37] or no difference between malignant and benign nodules [47, 48] (Table 1). These findings may partly be explained by the presence of fibrosis, found especially in PTC but also in benign tissue such as chronic autoimmune thyroiditis [60]. Thus, two studies employing SWE and ARFI, respectively [59, 61], found a positive correlation between nodular stiffness and the degree of fibrosis in histological thyroid specimens. These findings indicate that fibrosis may explain, at least partly, the overlap in EI seen between malignant and benign lesions.

Indeterminate cytology is more frequently associated with FTC, thus introducing preselection of these less stiff cancers in studies only investigating indeterminate nodules [37, 47]. On the contrary, studies excluding patients with indeterminate cytology without histological confirmation found higher stiffness of the malignant lesions due to a higher percentage of PTC [29,30,31, 62, 63].

Nodules size is another factor that should be taken into account. Different cut-off points may be applicable, with higher EI in larger nodules [29, 31, 34, 48]. Interestingly, one study found higher diagnostic accuracy of SWE in nodules less than 1 cm compared with larger nodules [55]. For the diagnosis of micro-PTC, relatively low cut-off points (34.5 kPa) were reported in two studies [31, 46].

Comparison of different technologies

One study, evaluating 84 thyroid nodules, compared the diagnostic accuracy of SE, ARFI, and SWE head-to-head [43]. AUC of ARFI was similar to that of SWE but was higher compared with SE. In contrast, a more recent study found superior performance of SE in comparison with SWE and Thyroid Imaging Reporting and Data System (TIRADS) [55]. Meta-analyses compared the diagnostic properties of SE against the pooled results of different SWE technologies (SWE, pSWE, ARFI). High diagnostic accuracy of both methodologies was found, with summary AUC in the range of 0.83–0.94, and false-positive and -negative rates of 14.6–16.0% and 3.1–5.0%, respectively [20, 49, 54]. Performance was slightly better for SE than the SWE technologies. However, pooling results from different cohorts, SWE technologies and technical settings, as well as different cut-off points, is not particularly meaningful. Thus, the encouraging results found in previous meta-analyses [20, 21, 49, 50, 52,53,54] are not clinically viable when pooled performance of different SWE technologies are assessed. When restricting a meta-analysis to studies only using SWE, the result is even less impressing [51].

SWE reproducibility

Agreement between repeated measurements is an important factor, influencing the performance of any diagnostic test. Although SWE is considered to be user independent [16, 19], several investigator-dependent steps may affect SWE acquisition and EI measurement by ROI placement. Therefore, the entire process of SWE acquisition, elasticity interpretation, as well as EI measurement need to be taken into account for an assessment of the SWE reproducibility. Several factors are difficult to standardize, e.g., pre-compression, plane selected for EI measurement, timing when freezing the live color-coded film sequence, and interpretation of artifacts. In contrast, a standardized selection and placement of ROI is easier to accomplish through predefined criteria and specific definitions of ROI [62]. Thus, SWE agreement is influenced by a number of factors that contribute to the overall variability of the method.

The reproducibility of thyroid SWE has been investigated in several studies [26, 34, 38, 40, 47, 59, 62, 64,65,66], as listed in Table 2. Data were retrieved by retrospective EI measurements from stored film sequences or images, or by assessment of the whole SWE process during both acquisition and EI measurements (ROI placement) (Table 2) [26, 38, 62, 64, 65].

Diverging results have been reported when evaluating the whole SWE acquisition process. One study found suboptimal agreement, with inter-, intra-, and day-to-day limits of agreement (LOA) ratio in the range of 1.7–3.7 and proportion of agreement in the range 0.63–0.88 [66]. Other studies reported inter-rater intra-class correlation coefficients (ICCs) of 0.34–0.85 and intra-rater ICCs of 0.59–0.85, depending on EI outcomes and heterogeneity of the elasticity map [34, 40, 47, 62]. For further details, see Table 2.

Factors influencing reproducibility

Lower agreement seems to exist when evaluating malignant compared with benign nodules, possibly due to a higher degree of elastic heterogeneity in malignant nodules, especially PTC (Fig. 3) [59, 62, 66].

Fig. 3
figure 3

SWE agreement in elastic homogeneous and heterogeneous nodules. a1-a3 three consecutive SWE examinations in the same homogeneous nodule. b1-b3 three consecutive SWE examinations in the same heterogeneous nodule

When only assessing ROI placement rather than the entire SWE acquisition process, studies of the thyroid gland and mammary tissue showed higher agreement, indicating that the largest variation lies within the multiple steps of SWE acquisition [62, 64, 65, 67,68,69]. One study reported significantly higher agreement by assessing the whole nodule EI, as compared with a 3 mm ROI (defined by the operator) around the stiffest area of the nodule [59]. This observation supports that the SWE acquisition process should be as simple as possible, limiting the number of investigator-dependent steps. Two studies found no influence of investigator experience on the prevalence of artifacts, or on the inter-observer agreement [66, 70].

Estimation of agreement is influenced by the statistical methods applied and the heterogeneity of data, making a comparison across studies difficult [71, 72]. A statistical test using limits of agreement (LOA) is considered the most suitable for a dataset with high heterogeneity. Indeed, heterogeneity of data obtained from thyroid elastography is caused by the inherent feature of the thyroid tissue, as well as variations within and between observers, and from day-to-day [61, 71,72,73].

Considerations concerning reproducibility

SWE reproducibility is put into further perspective, as it may be affected by the spatial heterogeneity of thyroid morphology [61, 73] as well as the dynamic properties of US [74, 75]. Conventional gray-scale US harbors an inherent inter-observer variability, which most likely influences the acquisition process of elastography. Several studies, investigating observer-agreement of single US features as well as US risk-stratification systems, reported diverging results, ranging from poor to substantial agreement (kappa: 0.11–0.91) [76,77,78,79,80,81]. Identification of thyroid calcifications has shown high inter-rater agreement (0.67–0.91) [76,77,78], but depends on the experience of the operator. On the contrary, US features difficult to interpret are nodule borders, the significance of a solid component within a complex nodule, signs of extra-thyroidal growth, and intra-nodular flow assessed by Doppler [10, 76,77,78]. Inter-observer agreement for US risk-stratification systems (e.g., TIRADS) are generally reported to have higher agreement (0.25–0.76) than single US features [78,79,80,81]. The clinical decision-making of when to perform FNAB according to the various US systems showed even higher agreement between investigators [78, 81].

Conflicting results were reported also for reliability assessment of other elastographic technologies applied to the thyroid. For SE, Park et al. [82] found no statistical significant inter-observer correlation for real-time acquisition and the interpretation of elastographic findings, while two other studies showed more promising results (inter-rater Cohens’ Kappa: 0.64 [83]; 0.74 [63]). Similar levels of reliability apply to ARFI and pSWE [84,85,86]. In an early SE reliability study [82], assessment of the influence from pre-compression was not possible (pressure control). However, this was implemented in subsequent studies [83], which might explain, in part, the differences across studies in the reliability of this method [22]. Although SE uses manual compression as external force, this technology is not necessarily more operator-dependent than SWE [18].

Methodological issues of thyroid SWE

SWE artifacts

Artifacts may arise during SWE acquisition [70], and caution must be taken when interpreting the elasticity map and the EI measurements (Fig. 4). One study found artifacts in 70% of 1297 SWE images, with the majority (35%) being caused by pre-compression [70]. In 19% of investigated nodules, SWE was uninterpretable due to artifacts. Pre-compression during SWE acquisition results from the pressure on the transducer applied by the investigator (Fig. 4a). The magnitude of pressure affects tissue elasticity, and EI increases with increasing pre-compression, potentially inducing unintended measurement variability [17, 87, 88]. Nodules located in the isthmus may be more prone to pre-compression due to the proximity to the trachea [18]. Currently, quantification of pre-compression is not possible in SWE; therefore, the use of generous amounts of gel to avoid artifacts at the cervical fascia is recommended.

Fig. 4
figure 4

SWE artifacts. a Pre-compression artifact seen by the presence of increased stiffness at the skin surface (red color band at the cervical fascia); b Artifact of increased stiffness in tracheal lumen in proximity to the tracheal cartilage; c Increased stiffness within normal thyroid parenchyma medially to a thyroid nodule and anterior to trachea; d Poor SWE signal in a marked hypoechoic nodule

Artifacts of increased stiffness also occur when structural interfaces are encountered, as disruption of the linear relationship between the speed of the shear waves and the elastic modulus is induced when crossing tissue borders (Fig. 4b). The split-screen mode is helpful in identifying the morphological borders on the gray-scale US, which may not be possible from the elasticity map [16, 17]. Vertical artifacts represent bands of increased stiffness crossing anatomical borders (Fig. 4c) [36]. Movement artifacts during SWE acquisition may arise from tiny unwanted movements of the transducer or the chest due to respiration, which result in persistent color changes in the image beyond the first 3–5 s. Movement artifacts from the carotids are reduced by placing the probe in the longitudinal plane during SWE acquisition [24]. The color codes should be stable before freezing the film sequence prior to EI measurements. However, this may not be possible in heterogeneous nodules, even when no movement is visible, which probably has an unfavorable impact on the EI reproducibility.

Micro- or macrocalcifications may induce increased stiffness [89], and elasticity measures should be avoided in areas with macrocalcifications. Avoiding microcalcifications may be more difficult, and their presence must be taken into account when interpreting EI measurement in nodules harboring such elements [89]. The gray-scale split screen allows the investigator to evaluate whether EI is influenced by calcifications or if other hyperechoic spots are present [10]. Shear waves do not propagate in fluid, making elasticity assessments impossible in cystic areas [16]. In fact, EI may be increased in nodules with cystic areas due to the low deformation potential of fluids [89]. Very hypoechoic or deeply located nodules may reflect a poor or even no SWE signal (Fig. 4d). In these nodules, SWE does not provide reliable information, even when adjusting the settings of the equipment [24, 36]. Poor SWE signal has been associated with malignancy [48], and may, thus, represent a surrogate marker of hypoechogenicity.

Thyroid heterogeneity

Elastography determines tissue elasticity indirectly by measurements of tissue response to applied external stress. The technology behind elastography relies on the assumption that the investigated tissue exhibits simple behavior, such as being linear and homogeneous [16, 90]. However, biological soft tissue exhibits properties of heterogeneity, non-linearity, and viscoelasticity, which makes it less suitable to fit into this simplified model [16, 73]. Although elastography may detect disorders of various organs [25, 26, 28,29,30, 36, 91,92,93], interpretation of elasticity data should be done with caution. Artifacts arising during SWE acquisition are largely explained by properties related to soft tissue. Pre-compression artifacts and the increased stiffness with increasing pre-compression load, reported by Lam et al. [87], are explained by the non-linear properties of soft thyroid tissue. Similarly, the heterogeneity of soft tissue, including boundaries between adjacent tissues with different properties, explains the phenomenon of structure interface artifacts [17]. Therefore, when assessing thyroid nodular stiffness, it is important to take into account the heterogeneous nature of thyroid nodules caused by cell density, calcifications, fibrosis, adipose tissue, and cystic areas [61]. These factors apply especially to PTC and may be presented as heterogeneous elasticity maps [73]. Considerable heterogeneity, both within and between subjects, has been reported when assessing the qualitative elasticity map [48, 66]. The standard deviation of EI has recently been proposed as a better measure of elastic heterogeneity than absolute EI values [39, 46, 48]. However, quantifying this heterogeneity is challenging, and identification of relevant quantitative markers of EI heterogeneity in the prediction of thyroid malignancy is yet to be accomplished and validated [39, 48, 66]. Texture analysis employing mathematical models has also been introduced as a novel method to assess elastic heterogeneity [73]. One feasibility study reported superior performance of texture analysis compared with conventional EI measurements [73]. Although the method seems promising, these findings need to be validated in future studies.

Conclusions

The present SWE technology seems not robust enough to be clinically implemented on a wide scale. SWE may be promising on a group level, but the risk of misclassifying a thyroid nodule seems unacceptably high in the individual patient due to the substantial overlap of EI values observed in benign and malignant lesions. A number of confounding factors affect elasticity measurements such as the heterogeneous nature of thyroid nodules, and the co-existence of fibrosis or autoimmune thyroiditis [60, 61].

In light of current evidence, there is a need for standardization and consensus on the most optimum SWE acquisition process. Operator dependent factors include the applied pre-compression level, the scanning plane, timing when freezing the live SWE film sequence, and the optimum ROIs. One recent guideline [24] suggested such standardization, but it remains to be confirmed in future studies whether this could help identifying a cut-off point that reliably can differentiate malignant from benign thyroid nodules on the individual level. Previous studies were highly heterogeneous in regard to the risk of malignancy, the process of SWE acquisition, as well as the EI endpoints evaluated, factors that hinder a reliable meta-analysis to be performed. The low observer agreement and the diverging results of repeated measurements add further concern to the clinical applicability of the current SWE method. This is probably explained by methodological limitations of the technology per se, in combination with the high degree of heterogeneity of thyroid nodular tissue [22]. Thus, it remains to be demonstrated whether clinically useful information can be achieved in patients with thyroid nodules, when SWE is applied on top of other US characteristics or genetic testing.