Introduction

Osteoarthritis (OA) is the most common musculoskeletal joint disease that mainly affects the hips, knees, hands and spine. It leads to pain and impaired function, especially in the elderly (Harzy et al. 2009). Its prevalence is expected to increase in the coming decades due to an ageing and increasingly obese population (Ackerman et al. 2018). OA significantly impacts patients’ quality of life by limiting their normal daily activities and by increasing the risk of further morbidity and of mortality (Corsi et al. 2018). Consequently, it is a heavy burden for people and, in time, will become a more significant healthcare problem (French et al. 2015).

Management of OA currently includes non-pharmacological and pharmacological treatments (French et al. 2015; Tenti et al. 2015). Among the non-pharmacological interventions, the most widely used include balneotherapy, mud therapy and spa therapy in addition to and alternating with other options (i.e. physiotherapy and exercise) (Forestier et al. 2017; Paoloni et al. 2017).

Specifically, balneotherapy is defined as the use of thermal mineral water in which the sum of the cations and anions is greater than 1 g/l, the temperature is not lower than 20 °C and the body is completely immersed (Bender et al. 2005; Branco et al. 2016; Tenti et al. 2015). It is commonly used in many European and Middle Eastern countries with the aim to improve pain and stiffness, strengthen muscle, relieve muscle spasm and maintain or improve functional mobility (Antonelli and Donelli 2018).

Mud therapy utilises a natural product consisting of a mixture of a solid component with a liquid component (mineral or thermal water) and applied in the form of a wrap, either locally or to the whole body (Fraioli et al. 2018; Paoloni et al. 2017; Tenti et al. 2015). Its application causes vasodilation and increases blood flow, metabolism and connective tissue elasticity resulting in a relief of muscle spasms and pain (Sarsan et al. 2012).

While spa therapy employs several treatment modalities, the most common are the combination of balneotherapy and mud therapy as employed in health resorts (Verhagen et al. 2015).

In the past decades, several clinical studies and reviews have evaluated the efficacy of balneotherapy, mud therapy and spa therapy in the treatment of musculoskeletal disorders (Fraioli et al. 2018; Harzy et al. 2009; Kulisch et al. 2014), but due to poor methodological quality and inadequate statistical analysis, the evidence is still unclear (Verhagen et al. 2015). Furthermore, in most studies, balneotherapy, mud therapy and spa therapy have been combined with other treatments such as exercise programmes, massage and rehabilitation. These multicomponent interventions hindered the possibility of measuring the effectiveness of a single intervention (Falagas et al. 2009).

Despite the wide implementation of a broad spectrum of therapeutic thermal modalities for the management of OA, and several systematic reviews and meta-analyses to evaluate their effectiveness (Antonelli et al. 2018; Beasley et al. 2019; Forestier et al. 2017; Paoloni et al. 2017), to our knowledge, there has been no systematic effort to summarise and critically appraise this body of evidence.

Therefore, we adopted an overview of systematic reviews to combine evidence from a wide range of interventions and outcomes, focussing on evidence from systematic review articles evaluating different thermal modalities for the management of OA.

Specifically, our systematic review addressed the following question: In adults suffering from OA, do balneotherapy, mud therapy and spa therapy lead to a reduction of pain and stiffness and an improvement in quality of life?

Materials and methods

We applied the guidelines for conducting an overview of reviews from the Cochrane Handbook for Systematic Reviews of Interventions (Higgins et al. 2019), and adhered to the preferred reporting items for systematic review and meta-analyses (PRISMA) statement (preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation 2016). The study protocol was registered in the Prospective Register of Systematic Reviews (PROSPERO) and it is publicly available under registration number CRD42019133930.

Criteria for considering reviews for inclusion

Eligibility criteria for the overview were established using the Population, Intervention, Comparator, Outcome and Study design (PICOS) framework, to include the following:

Participants: adults (≥ 18 years) with osteoarthritis (OA).

Interventions and comparators: interventions included balneotherapy, mud therapy and spa therapy versus usual care, placebo or no interventions. Specifically, when we referred to balneotherapy (bathing in natural mineral or thermal/sulphur waters) and mud therapy (mud bath, mud pack/peloid), we considered them as a ‘solitary’ approach outside the spa context. This is because the spa context may have some psychological effects capable of influencing the subjective outcome measurement.

We did not include reviews/trials in which the above interventions were provided in combination with exercise/physiotherapy/training interventions. These co-interventions were only admitted in the trials if the exercises were provided in both branches of the studies (with the same duration /frequency/intensity). We also excluded any trials performed in a specific thermal location with unique environmental conditions (climate, altitude, barometric pressure) that could be confounding factors. Hydrotherapy trials, defined as the use of normal tap water for therapeutic purposes, were also excluded.

Outcome measures: the primary outcomes of interest were pain (VAS, WOMAC scale), stiffness (WOMAC scale) and quality of life (SF36-12, Nottingham Health Profile, Stanford Health Assessment Questionnaire, EQ-5D index). We included these outcomes because the European League Against Rheumatism (EULAR) recommendations (Fernandes et al. 2013; Kloppenburg et al. 2019) consider the control of such symptoms the primary goal of OA management.

Studies: any systematic reviews (SR) of randomised controlled trials (RCT) and non-randomised controlled studies (NRS) used for evaluating the effects of interventions.

SRs were those that were in accordance with the definition proposed by the Cochrane Collaboration’s Handbook (Cumpston et al. 2019).

Information sources and search

We searched PubMed, Scopus, Web of Science, CINHAL, Cochrane Library, PEDro and ProQuest databases from inception until 30 October 2020, with no language restrictions. The complete search strategy is summarised in Online Resource 1.

So as to include other potentially eligible reviews, the lists of references from the retrieved reviews were also examined.

Study selection

Eligible studies were selected using a multi-stage approach (title-abstract, full-text reading) by two independent researchers (LI and DD), and any discrepancies were resolved by consensus. If there was any disagreement, this was discussed in detail with a third researcher (DC) until consensus was reached.

Assessment of methodological quality of reviews

Two review authors (LI and FG) independently assessed the included reviews using the AMSTAR2 methodological quality measure tool (Shea et al. 2017). It is an updated version of the original AMSTAR (Shea et al. 2007) tool, specifically developed to assess the methodological quality of systematic reviews that include both randomised and non-randomised studies of healthcare interventions. AMSTAR2 includes the following critical domains: protocol registered before start of review; adequacy of literature search; justification for excluded studies; risk of bias for included studies; appropriateness of meta-analytic methods; consideration of risk of bias when interpreting results; and assessing presence and likely impact of publication bias.

Data collection and analysis

Data were extracted from the full text by one of the authors (DD) and reviewed independently by another (LI).

Data were extracted at two levels, the first regarding the SRs and, the second, the studies included in each SR. The following data were extracted:

RS characteristics: authors, years of publication, research questions, databases searched, year searched, type of studies included, number of studies included, number of participants, interventions/comparator, outcome investigated, main findings.

Overlap among studies (only RCT) included in the SRs: As the degree to which the reviews shared the same RCTs could affect interpretation of results, the overlap between reviews and the number of RCTs that were unique to each review were assessed. An evidence map was prepared for the entire overview and used to calculate the ‘corrected covered area’ (CCA) (Pieper et al. 2014).

Data synthesis, analyses and classification of RCTs: A re-analysis of outcome data was planned for this overview, as a substantial difference in analysis results across the systematic reviews, and/or a lack of meta-analysis was expected (Higgins et al. 2019).

The following data were extracted from SRs: RCT sample size, intervention, property, duration and follow-up points. In cases where data on trial characteristics were not available in the included SRs, the missing data were extracted and/or the missing quality assessments were completed independently by two reviewers (LI and DC) using the primary research paper. Quality assessment was performed using the JADAD scale, which describes items pertaining to description of randomisation (2 points), appropriateness of blinding (2 points) and dropouts and withdrawals (1 point) (Jadad et al. 1996).

For quantitative analysis, outcome data were extracted from RCTs and meta-analysed using REVIEW MANAGER 5.3 (The Nordic Cochrane Centre, The Cochrane Collaboration, 2014). In order to conduct the statistical analysis for meta-analysis, sample sizes, means and standard deviations for the experimental and control groups were extracted. Continuous outcomes were expressed using mean differences with 95% CIs. All analyses used random-effects models and, to reduce heterogeneity, sample data were normalised using appropriate scale factors to obtain comparable means and standard deviations. Heterogeneity was assessed using the I2 statistic, and whenever possible, publication bias was assessed using funnel plots. A post hoc sensitivity analysis was conducted excluding studies that may impact the results of meta-analysis.

The clinical efficacy outcome as assessed at the short-term follow-up point, which was used in each trial, was taken as the defining moment for identifying the effectiveness of the treatment. When we encountered incomplete data, the authors of the trial were contacted.

For NRSs, qualitative analysis via a narrative approach was used.

Results

Search results

The literature search retrieved 116 unique references, of which 92 were excluded after title and abstract screening. Of the 24 potentially relevant SRs, 7 were excluded after a full-text reading (Online Resource 2). Therefore, 17 SRs were included in this overview. The Study Flow Diagram according to the PRISMA statement is reported in Fig. 1.

Fig. 1
figure 1

Flow diagram of study selection process

Characteristics of the SRs and their trials

Included SRs were heterogeneous in terms of interventions studied, comparison groups, population, outcomes and follow-up. A detailed description of the included SRs is shown in Table 1.

Table 1 Characteristics of the included reviews

The review question investigated the effect of different thermal modalities on knee joints in 10 SRs, on hand joints in 3 SRs and on any part of the body in 4 SRs. The intervention studied was balneotherapy in 7 SRs (Antonelli et al. 2018; Brosseau et al. 2002; Falagas et al. 2009; Harzy et al. 2009; Katz et al. 2012; Matsumoto et al. 2017; Verhagen et al. 2007) and mud therapy in 5 SRs (Crespin 2017; Espejo-Antunez et al. 2013b; Hou et al. 2020; Liu et al. 2013; Xiang et al. 2016), while 4 SRs researched both modalities (balneo-mud therapy) and their combination within a resort (spa therapy), and one SR (Fraioli et al. 2018) analysed 4 different thermal modalities separately.

The publication dates of SRs ranged from 2002 to 2020. Thirteen SRs (76%) included RCTs only. The number of included studies (RCT/NRS) per SR ranged from 19 (Forestier et al. 2016) to 1 (Katz et al. 2012), and the number of participants from 1612 (Forestier et al. 2016) to 44 (Katz et al. 2012). Only two NRSs were considered eligible.

A detailed look at thermal modalities for all the RTCs included in the meta-analysis is shown in Online Resource 3.

Overlap of RCTs between SRs

After accounting for overlapping RCTs contained within multiple SRs, a total of 25 unique RCTs (about 1780 participants) remained (Online Resource 4). Only 4 (16%) RCTs did not overlap among SRs, giving an overlap percentage of 84%. A total of 25 primary studies were cited 108 times across the 17 SRs included in this overview, resulting in a CCA of 30% and indicating a very high overlap (Table 2).

Table 2 Number of included RCTs that overlapped among reviews

Methodological quality of included reviews and RCTs

The quality assessment of the seventeen SRs is presented in Table 3.

Table 3 Quality assessment of included SR according AMSTAR 2

All included reviews had multiple flaws according to the AMSTAR-2 assessment tool.

The quality of the reviews ranged from low (Hou et al. 2020; Matsumoto et al. 2017; Verhagen et al. 2007) to critically low (Antonelli et al. 2018; Beasley et al. 2019; Brosseau et al. 2002; Crespin 2017; Espejo-Antunez et al. 2013b; Falagas et al. 2009; Forestier et al. 2016; Fortunati et al. 2016; Fraioli et al. 2018; Harzy et al. 2009; Katz et al. 2012; Liu et al. 2013; Xiang et al. 2016). Other than the SR by Matsumoto et al. (Matsumoto et al. 2017), all other SRs failed to satisfy critical item 2 (protocol registered), and only the SR by Verhaegen et al. (Verhagen et al. 2007) met critical item 7 (list of excluded studies). Thus, all SRs judged critically low failed to satisfy these two specific critical items. The quality of the included RCTs ranged from 2 to 5 (Online Resource 5). A study was considered to be of high/moderate quality if the score was 3 to 5, and of low quality if the score was 1 to 2. The most common reason for point deduction was the absence of double blinding (63%), probably due to the nature of the intervention. Four RCTs (Kovacs et al. 2012; Kovacs and Bender 2002; Tefner et al. 2013; Yurtkuran et al. 2006) were assigned a JADAD score of 5 (highest score), five RCTs (Forestier et al. 2010; Giannitti et al. 2017; Pascarelli et al. 2016; Szucs et al. 1989; Wigler et al. 1995) received a score of 4, eleven RCTs (Balint et al. 2007; Branco et al. 2016; Fioravanti et al. 2015; Fioravanti et al. 2012; Fioravanti et al. 2010; Gungen et al. 2012; Horvath et al. 2012; Karagulle et al. 2007; Mahboob et al. 2009; Odabasi et al. 2008; Sherman et al. 2009) a score of 3 and five RCTs (Espejo-Antunez et al. 2013a; Evcik et al. 2007; Mika et al. 2006; Sarsan et al. 2012; Tishler et al. 2004) scored 2.

Data synthesis and meta-analysis

Balneotherapy

Four eligible RCTs (Horvath et al. 2012; Kovacs et al. 2012; Kovacs and Bender 2002; Tishler et al. 2004) were not included in the analysis due to a lack of data, seven RCTs were included in the analysis (Fig. 2).

Fig. 2
figure 2

The effect of balneotherapy on pain (panel a), stiffness (panel b) and QoL (panel c)

Pain: Seven RCTs (499 participants) included in six reviews assessed the effect of balneotherapy on pain (Fig. 2 panel a). Their quality ranged from 2 to 5 (mean 3.7). The results of our meta-analysis indicate that, on average, balneotherapy reduced the pain score by 19.73 when compared with controls (MD = − 19.73; 95% CI − 35.72 to − 3.74; p < 0.02). There was a higher degree of statistical heterogeneity across studies (I2 = 99%; p < 0.00001). Funnel plot examination showed asymmetry, suggestive of publication bias in the context of four smaller studies in favour of controls.

Stiffness: Five RCTs (382 participants) included in six reviews addressed the effect of balneotherapy on stiffness (Fig. 2, panel b). Their quality ranged from 3 to 5 (mean 3.4). The results indicate that balneotherapy improved the clinical effective rate of relieving stiffness by 20.39 when compared with controls (MD = − 20.39; 95% CI − 38.21 to − 2.57; p < 0.02). There was a higher degree of late statistical heterogeneity across studies (I2 = 98%; p = 0.00001). Funnel plot examination showed the presence of symmetry between studies. Sensitivity analysis (Online source….) performed excluding the study by Branco et al. (Branco et al. 2016) did not find any significant differences (MD = − 20.39; CI − 38.21–2.57).

Quality of life: Three RCTs (281 participants) included in two SRs evaluated the effect of balneotherapy on quality of life (Fig. 2, panel c). Their quality ranged from 2 to 5 (mean 3.3). The results showed that balneotherapy improved quality of life by − 20.48 when compared with controls (MD = − 20.48; 95% CI − 32.44 to − 8.52; p = 0.00008). There was a higher degree of statistical heterogeneity across these studies (I2 = 90%; p = 0.00001).

Sensitivity analysis (Online source 6) performed excluding the study by Branco et al. (Branco et al. 2016) reproduced relatively similar point estimates with lower heterogeneity confirming the significant differences for pain (MD = − 20.39; 95% CI − 38.21 to 2.57) and stiffness (MD = − 7.7; 95% CI − 12.70 to − 1.35). Nevertheless, the sensitivity analysis showed a no longer significant effect of balneotherapy on quality of life (MD = − 1.55; 95% CI − 12.48 to 9.35).

Mud therapy

Eleven RCTs were included in the analysis (Fig. 3).

Fig. 3
figure 3

The effect of mud therapy on pain (panel a), stiffness (panel b) and QoL (panel c)

Pain: Eleven RCTs (693 participants), included in nine SRs, evaluated the effect of mud therapy on pain (Fig. 3, panel a). Their quality ranged from 2 to 5 (mean 3.1). The analysis showed slightly significant differences between the experimental group that received mud therapy and controls (MD = − 8.79; CI − 17.33 to − 0.25; p = 0.04), with high heterogeneity level (I2 = 87%). Funnel plot examination showed no publication bias.

Stiffness: Seven RCTs (380 participants), included in ten SRs, assessed mud therapy on stiffness (Fig. 3, panel b). Their quality ranged from 2 to 5 (mean 3.2). Results showed that mud therapy significantly reduced stiffness (MD = − 14.10; CI − 17.87 to 10.33; p < 0.00001) with moderate heterogeneity (I2 = 42%). There was no evidence of publication bias.

Quality of life: Four RCTs (238 participants), included in eight SRs, assessed mud therapy on quality of life (Fig. 3, panel c), but no statistically significant difference in quality of life was observed between groups (MD = − 0.71; CI − 15.07 to 13.64; p = 0.92). Their quality ranged from 2 to 5 (mean 3).

SPA therapy

One eligible RTC (Wigler et al. 1995) was excluded because of missing data.

Quality of life: Three RCTs (481 participants), included in six SRs, assessed the effect of spa therapy on pain relief (Fig. 4). Results showed that patients receiving spa therapy experienced less pain compared with a control group (MD = − 11.72; CI − 22.18 to − 1.26; p = 0.03), with substantial heterogeneity (I2 = 83%).

Fig. 4
figure 4

The effect of SPA therapy on pain

Narrative synthesis of NRS

Balneotherapy

Pain and quality of life: In the study by Gaal et al. (Gaal et al. 2008), included in one SR (Fraioli et al. 2018), the effects of balneotherapy on chronic musculoskeletal pain, functional capacity and quality of life in elderly patients with OA or chronic degenerative low back pain were analysed. The study population consisted of 81 patients (41 with OA) who underwent 15 balneotherapy sessions lasting 30 min and administered daily. A significant decrease in mean disease severity, rated by the patients on a visual analogue scale (VAS), and quality of life was observed during the period between the two initial visits (p < 0.001). Specifically, the VAS score was 68.53 mm (SD 10.6 mm) at visit 1, 15.63 mm (SD 7.98 mm) at visit 2 and 12.58 mm (SD 7.12 mm) at visit 3, while quality of life score was 216.93 (SD 61.17) at visit 1, 558.78 (SD 150.25) at visit 2 (p < 0.001) and 708.66 (SD 42.29) at visit 3.

Mud therapy

Pain: An NRS by Fraioli et al. (Fraioli et al. 2011), included in 4 SRs, established a comparison between an experimental group that received three full cycles of mud bath therapy (12 treatments) over 1 year, and a control group that continued with daily pharmacological treatment. The study population consisted of 61 patients with knee OA. After treatment, the mean value reported in the experimental group was significantly lower than that reported in the control group (3.53 vs. 5.73), p = 0.000 (Student’s t test).

Discussion

This overview represents a systematic, comprehensive and thorough review of the evidence supporting the efficacy of balneotherapy, mud therapy and spa therapy in patients with OA.

The SRs often categorised the interventions into the broad definition of ‘thermal modalities’ using the terms ‘mud therapy’, ‘balneotherapy’ and ‘spa therapy’ interchangeably and in connection with each other (Beasley et al. 2019; Falagas et al. 2009; Forestier et al. 2016; Fortunati et al. 2016; Fraioli et al. 2018; Harzy et al. 2009; Katz et al. 2012; Tenti et al. 2015). In fact, considerable heterogeneity was found in how the SRs classified the thermal modalities. The term ‘balneotherapy’, for instance, was used in both the broad (mineral/thermal water or mud bath/pack) and the strict sense (only mineral/thermal water), the term ‘spa therapy’ was employed to define a combination of interventions (mud pack, along with mineral bath and manual therapy), as well as a single treatment inside or outside a resort.

One implication was that the SRs that focussed on different interventions often included the same primary studies and, consequently, we encountered serious difficulties in producing the primary studies’ classification. For example, the study by Fioravanti et al. (Fioravanti et al. 2010) on spa therapy had been included in SRs addressing balneotherapy (Matsumoto et al. 2017), mud therapy (Espejo-Antunez et al. 2013b; Fraioli et al. 2018; Xiang et al. 2016) and spa therapy (Forestier et al. 2016; Tenti et al. 2015), respectively. This fact resulted in a very high level of overlap of primary studies among the 16 SRs and a possible misclassification.

Therefore, to improve our ability to address the research questions, we collected data directly from the primary studies originally reported by one or more of the 17 included SRs. In an attempt to reduce heterogeneity, the different thermal modalities were combined as specified in the ‘methods’ section of our review.

From a careful re-analysis of RCT data on balneotherapy, a significant reduction of pain and stiffness, and an improvement in quality of life emerged. Although the quality of the RCTs was rated as moderate, the high level of heterogeneity across the studies suggests that the pooled results of balneotherapy on the different outcomes should be interpreted with caution. As noted in other SRs (Matsumoto et al. 2017; Verhagen et al. 2007), we may reasonably believe that the high heterogeneity is a result of the small sample size in the considered RCTs, which were probably underpowered. Another reason for this finding may be that the specific scale indexes and specific measurement standards used in the various scales were inherently different. Furthermore, we should be aware that it is nearly impossible to maintain the exact content/ingredients of the mineral water across trials, as this depends on the country, area and specific location. Of note, sensitivity analysis yielded similar results with lower heterogeneity level for pain and stiffness, while showing no longer significant improvement in quality of life.

Data from RCTs on mud therapy showed a significant reduction in pain and stiffness with high/moderate heterogeneity, while the analysis of quality of life failed to show any significant beneficial effect. Worthy of note is that among the 5 RCTs on stiffness, only one RCT applied the double-blind design and all RCTs used a small number of subjects (ranging between 10 and 53 per branch). These factors may have contributed to an overestimate of the efficacy of therapies. It should be noted that each assessor used the same scale (WOMAC) to evaluate stiffness, thus reinforcing the importance of observing common and accepted outcome measures in order to limit heterogeneity in assessment contexts.

The summary measures on spa therapy showed significant pain relief, but high heterogeneity might impair the reliability of the results. The distinction between the effects of thermal applications per se and the benefits that could be derived from a stay in a spa environment is still debated (Bender et al. 2005; Falagas et al. 2009; Fioravanti et al. 2010).

Overall, the high level of heterogeneity in our results is consistent with the conclusions of other SRs (Antonelli et al. 2018; Matsumoto et al. 2017; Xiang et al. 2016). It highlights the need to consider different thermal modalities as separate entities (Bender et al. 2005) and to urge authors to keep them separate when planning interventions in order to make studies more comparable. Although we acknowledge that boundaries between these modalities may be blurred, it is crucial that organisations use some commonly accepted terminology and descriptions of content (Gutenbrunner et al. 2010).

Furthermore, as the quality of SRs ranged from low to very low, authors should improve SR quality by increasing the use of a priori protocols, and by providing a list of excluded studies with reasons for exclusion. They should also practise transparency in reporting the sources of funding of primary studies included in the review.

The included SRs had several methodological limitations that may have affected confidence in the reported results (Iannone et al. 2020). Heterogeneity of types and characteristics of interventions, even within the same thermal modality (balneotherapy, mud therapy or spa therapy), was the most significant problem that emerged from the present study. To mitigate this problem, we reanalysed data derived from primary studies that had previously been reported in the 17 SRs but that may not have reflected the entirety of the published literature. Furthermore, in some primary studies, the description of the interventions was too vague to provide sufficient understanding for appropriate classification. Even in cases in which the intervention was described in detail, both the duration (intensity and length) and modality used for its delivery varied considerably.

Finally, in an attempt to reduce variability, we decided to analyse only the short-term effect of different thermal modalities, even though the long-term effect is an important factor in determining continuing effectiveness and cost benefits.

Conclusion

Our overview of reviews provided an updated analysis of SRs focussing on different thermal modalities. Overall, there is some encouraging evidence that deserves clinicians’ consideration, suggesting that thermal modalities are effective on a short-term basis for treating patients with AO.

However, the evidence supporting the efficacy of different thermal modalities is limited due to methodological quality and sample size, and to the presence of important treatment variations. The results of our meta-analysis, in particular, should be interpreted with caution, due mainly to the high level of heterogeneity and the absence of a double-blind design. That said, the difficulty in carrying out blind studies is widely known owing to the nature of such interventions.

Further high-quality RCTs are needed to help draw firm conclusions. Research should examine the effects of different thermal modalities while maintaining a clear distinction between them. When possible, the beneficial effect of spa therapy should be observed as a confounder or an effect modifier, and this should be considered in the study design.