Introduction

Patient-reported outcome (PRO) measurements have become an important tool to evaluate activities, limitations in everyday life and the quality of life in hip disorders. Conventional questionnaires focus on patients with osteoarthritis or undergoing hip arthroplasty with a limited activity level [14]. Over the last decade, the better understanding of specific hip pathologies has evolved joint-preserving procedures such as hip arthroscopy or surgical hip dislocation [5]. The mainly young and active patients undergoing these joint-preserving procedures have different expectations and aims of their surgery; conventional outcome tools do not reflect their situation adequately [68]. Recently the Multicenter Arthroscopy of the Hip Outcomes Research Network (MAHORN) study group developed a new PRO questionnaire with special concern for young, active patients with hip disorders [9]. Compared to other questionnaires, the iHOT-33 includes inquiries of limitations in social interactions, emotional issues and working life. The iHOT-33 has shown a high reliability and validity. So far, there has been no German version of the iHOT-33.

Studies evaluating measurement properties have to meet a high methodological quality. The COSMIN checklist (COnsensus-based Standards for the selection of health status Measurement INstruments) is a consensus-based checklist to evaluate the methodological quality of studies on measurement properties of health status measurement instruments based on an international Delphi study in 2010 [10].

The purpose of this study was to validate a German version of the iHOT-33 according to the COSMIN checklist and evaluate the psychometric properties of the iHOT33 subscales.

Materials and methods

International Hip Outcome Tool (iHOT-33)

In 2012, the Multicenter Arthroscopy of the Hip Outcomes Research Network (MAHORN) developed the International Hip Outcome Tool (iHOT-33), a self-administrated questionnaire comprising of 33 items. [9]. The questionnaire consists of four different subscales indicating symptoms/function, sports activity, social, and occupational limitations. The patient is asked to consider the problems of the past month and to indicate the severity on a 100-mm horizontal line by marking it with a slash. Each question has equal weight so that the mean of all questions amounts to the score result ranging from 0 to 100. A score of 100 indicates full function and no symptoms, whereas a score of zero signifies maximum limitations and extreme symptoms. The iHOT-33 has shown a high internal consistency, construct validity, and responsiveness [8, 9, 1113].

Adaption of the iHOT-33

The translation of the iHOT-33 into German was carried out following the guidelines of the American Academy of Orthopedic Surgeons (AAOS) Outcomes Committee [14]. According to these guidelines, an informed and an uninformed translator translated the iHOT-33 from English into German independently. After consolidation of both translations, a German linguist reviewed the German version of the questionnaire. Two native speaking translators (informed and uninformed) re-translated this German version into English. This version was verified for consistence. Finally, the German questionnaire was tested for comprehensibility in 20 patients with a hip disorder. The translation process was supervised and documented in a survey report.

Validation study

A prospective multicenter clinical trial was performed to evaluate reliability, validity, and responsiveness of the German version of the iHOT-33. Inclusion criteria were a history of a hip disorder, a score of ≥4 on a modified Tegner Activity Scale [15], and sufficient reading and comprehension capacity. Another disorder of the back or the contralateral lower extremity, a score less than 4 on the modified Tegner Activity Scale, a mental disorder, or a lack of informed consent to participation were exclusion criteria [9]. All patients were seen in an outpatient setting. The patients primarily completed the questionnaire before seeing the orthopedist. For evaluation of test–retest reliability, the patients completed a second questionnaire after a minimum of 2 weeks. The patients were asked to answer the questions according to their current status and return the forms by mail. We reminded all patients who did not answer within 6 weeks by telephone. All patients had given their written informed consent to participate in this study. The local ethical committee approved the study in November 2013.

Questionnaire

In addition to the G-iHOT-33, the questionnaire consisted of the following scores.

Hip Outcome Score (HOS)

The HOS is an established 31-item PRO tool to evaluate activities, limitations in everyday life, and quality of life of patients with a hip disorder. It comprises of two subscales on activity of daily life and sports activities. The patient is asked to answer the questions considering the past week. Scores range 0–100, higher scores represent a better function and a higher level of activity [16]. The HOS has been validated and published in German in 2011 [17].

Modified Tegner Activity Scale (mTAS)

The TAS is a 10-level activity scale reflecting the patient’s currently highest level of sports activity or other routine activities. Initially it was designed as a complement to other functional scores of the knee joint and is the most commonly used activity scoring tool [15]. Although there is no validation study of the hip modification of the TAS, it is also well established and widely used [9, 13, 18, 19]. A score greater than 4 was an inclusion criterion for the evolution study of the iHOT-33 by Mohtadi [9]. Hence, we included the scale to our questionnaire to achieve a similar cohort.

EuroQol-5D (EQ5-D)

The EQ5-D is a global quality of life questionnaire consisting of a 5-item assessment of the health status regarding mobility, self-care, usual activities, pain/discomfort, and anxiety/depression [20]. The second part of the EQ5-D consists of a 200-mm analog scale concerning the patient’s assessment of the current global health status. The EQ5-D has been adapted to German and is validated for a number of health compromising conditions [21].

Subjective assessment

The patient was asked to assess his current limitations concerning function (pain, ROM, etc.), sport/leisure activities, employment/housekeeping, and social interaction/quality of life. The limitations should be estimated in percent from 0 % = no limitation at all to 100 % = maximum.

The second set of questions also included an evaluation of whether the condition of their hip joint was ‘much better’, ‘somewhat better’, ‘unchanged’, ‘somewhat worse’, or ‘much worse’ compared to the primary evaluation.

Statistical analysis

Questionnaires with any missing data or unclear marking were excluded from the analysis. Statistical analysis was performed using the software package SPSS (Version 22, SPSS Inc., Chicago, Illinois). Unless otherwise stated, descriptive data are given as mean ± standard deviation. The level of significance was defined at p < 0.05 for all tests.

Methodological testing according to the COSMIN checklist

Reliability

Reliability is the degree to which the measurement is free from measurement error [22]. To evaluate reliability, internal consistency, test–retest reliability, and measurement error are calculated.

Internal consistency

Internal consistency is described as the degree of interrelatedness among items [22]. Cronbach’s alpha was calculated in total and for each subscale separately. Sufficient internal consistency was assumed for a Cronbach’s α greater than 0.7 [23].

Test–retest reliability

Test–retest reliability is the extent to which results of the same patient in the same health condition remain unchanged over time [22]. According to the recommendation of the COSMIN manual, the retest was performed after a minimum of 2 weeks after outpatient consultation to avoid recollection of the answers and changes in health condition. Intraclass correlation coefficients (ICC) were calculated for all patients indicating an unchanged condition of their hip joint since the primary evaluation. For an ICC greater than 0.7 sufficient test–retest reliability was assumed [23].

Measurement error

The measurement error is the systematic and random error of a patient’s score that is not attributed to true changes in the construct to be measured [22]. The Standard Error of Measurement (SEM) was calculated using the formula \({\text{SD}}/\sqrt { 1- {\text{ICC}}}\) (SD = standard deviation; ICC = intraclass correlation coefficient) [23]. The smallest detectable change (SDC) reflects the smallest individual change in score that can be interpreted as a real change. It was calculated by the \({\text{SEM}} \times 1. 9 6 { } \times \, \sqrt 2/\sqrt n\) [23].

Validity

Validity is the degree to which a questionnaire measures the construct it purports to measure [22].

Construct validity

Since there is no gold standard in the measurement of PRO, validity was rated as construct validity. Construct validity is the degree to which the scores of a questionnaire are consistent with questionnaires measuring the same construct. To validate the G-iHOT33, Pearson’s correlation coefficient was calculated between the iHOT33 subscales and a subjective rating by the patient as well as between iHOT33 subscales 1 and 2 (symptoms/function and sports activity) and the HOS subscales and between the social/emotional subscale of the iHOT33 (No. 4) to the EQ-5D score. For a correlation coefficient r < 0.3, a poor and for r > 0.7 a good correlation was assumed.

Hypothesis testing

To analyze construct validity, we tested a priori hypotheses [22]. We hypothesized that the G-iHOT33 would correlate well with the other subjective scales like the HOS and the EQ-5D. Therefore, a Pearson correlation coefficient r > 0.7 was expected. Low correlations r < 0.3 were expected between the G-iHOT33 and the mTAS. The subjective global rating of change (GRC) was correlated with the mean difference between the G-iHOT33 scores at T2–T1. We hypothesized that the changes in the G-iHOT33 score would correlate with the subjective evaluation of the patient [19, 24].

Responsiveness

Responsiveness is the ability of a questionnaire to detect a change over time in the construct to be measured [22]. According to Terwee et al. [23], responsiveness was demonstrated by comparing the smallest detectable change (SDC) to the minimal important change (MIC). Responsiveness was confirmed if the SDC < MIC.

Interpretability

Interpretability is the ability to transform a qualitative effect into a quantitative score [22]. The minimal important change (MIC) was estimated by the dividing the standard deviation (SD) by two as described by Norman et al. [25]. In addition, we used an anchor-based method to evaluate responsiveness. At T2, the patients were asked to rate whether the current condition of their hip joint was ‘much better’, ‘somewhat better’, ‘unchanged’, ‘somewhat worse’, or ‘much worse’ compared to the condition of the primary evaluation. The effect size (ES) was calculated by the mean change of the score/SD. The 95 % CI of the effect size of the “somewhat better” group was compared to the ES of the “unchanged” group to estimate the true MIC (Fig. 1).

Fig. 1
figure 1

Estimation of the effect size

Another quality criterion for content validity is the absence of floor and ceiling effects. If more than 15 % of patients score highest (100) or lowest (0) value in the G-iHOT33, extreme outcome values might not be represented adequately [23].

Results

Demographic data and generalizability

Between December 2013 and December 2014 eighty-three patients completed both questionnaires and were available for data analysis. The cohort comprised of 24 women (29 %) and 59 men (71 %). The mean age was 33.7 ± 11.8 years (range 14–63). Demographic data and diagnosis-related score results are provided in Table 1. The second questionnaire was completed on average 28.5 ± 31.7 days (range 14–194) after the first. Missing items were found in 92 of 5146 items in total (1.78 %). Questionnaires containing missing items or unclear marking were excluded from the analysis. Missing items occurred randomly; there was no accumulation of missing items in any unit of the questionnaire.

Table 1 Demographic data and diagnosis-related score results

Reliability

A Cronbach’s α of 0.97 (95 % CI 0.96, 0.99) showed excellent internal consistency for the G-iHOT33. Internal consistency was appropriate with a Cronbach’s α of between 0.88 and 0.96 for each subscale. The intraclass correlation coefficient (ICC) was 0.88 (95 % CI 0.80, 0.93) for all patients indicating an unchanged condition of their hip joint since their primary evaluation. The overall SEM was 8.9. Hence, the smallest detectable change (SDC) reflecting the smallest individual change in score that can be interpreted as a real change was 2.7.

Validity

The assessment of the construct validity showed a good correlation between the subscales of G-iHOT-33, HOS, and EQ-5D (Table 2). With a Pearson correlation coefficient of 0.07, there was only a poor correlation between G-iHOT33 and the mTAS. Therefore, all hypotheses could be confirmed. Adequate responsiveness of the G-iHOT33 could be demonstrated with a higher value of MIC (12.7) compared to SDC (2.7). According to the GRC-dependent 95 % CI of the ES, the estimation of ES was confirmed (Fig. 1). In addition, there was a good correlation between the global rating of change (GRC) and the mean difference between the G-iHOT33 scores at T2–T1 (Table 3; Fig. 2). According to the ES of 0.6 the minimal important change, a change that reflects a clinically relevant improvement is approximately ten points on the iHOT33.

Table 2 Pearson correlation coefficients between functional scores
Table 3 Change sensitivity of the iHOT33 based on global rating of change
Fig. 2
figure 2

Mean difference of the iHOT33 for GRC

There was no floor effect for the G-iHOT33, since there was no patient with a score value of zero. Only five patients (6 %) scored a maximum score of 100, which is why relevant ceiling effects could be declined.

The Pearson correlation coefficients between the subscales concerning symptoms/function, sports activity, social, and occupational limitations and the subjective rating are shown in Table 4. We found a good correlation between the subscales and the HOS subscales and the patient’s subjective rating for each of the subscales. The highest correlation coefficient of r = 0.83 was found between the social subscale and the EQ-5D. However, none of the subscales showed floor- or ceiling-effects.

Table 4 Subscale analysis for G-iHOT33

Discussion

The present study evaluated the German version of the iHOT33. This German version of the iHOT33 provides sufficient validity, reliability, and responsiveness for the evaluation of physically active patients with non-arthritic hip problems. In addition, we are able to present results on psychometric properties of the iHOT33 subscales on symptoms/function, sports activity, social, and occupational limitations separately. This is the first study validating the 33-item iHOT following the complete COSMIN checklist.

Since femoro-acetabular impingement syndrome (FAI) has been identified as a risk factor for osteoarthritis of the hip joint, the development of joint-preserving procedures has been advanced in the last decade. Due to technical improvements, hip arthroscopy has become a successful procedure to relieve pain and to restore clinical function in FAI [5, 8, 2629]. Patient-reported outcome tools are becoming more and more important to reflect the patient’s view of the postoperative outcome and limitations in everyday life. For patients with hip fractures, osteoarthritis or those undergoing hip replacement several established PRO already exist. Recently, some questionnaires were developed to evaluate the postoperative outcome in this cohort of young, physically active patients. These questionnaires mainly focus on symptoms and function in everyday life [24]. Most of them, however, do not give a comprehensive picture of the patients’ views. Some authors have already pointed out a discrepancy between functional results and patient satisfaction in patients undergoing hip arthroscopy [57, 30, 31]. Social, emotional, and occupational factors might also play an important role in the patients’ assessment of the therapy. Therefore, the Multicenter Arthroscopy of the Hip Outcomes Research Network (MAHORN) has recently developed an outcome measurement instrument including questions on social, emotional, and occupational limitations. The iHOT has also been cross-culturally adapted into Spanish, Portuguese, and Swedish [3234]. Recent comparative studies have shown good results for most of the psychometric properties of the iHOT33 [4, 9, 12, 18, 33]. Our study also showed a high level of reliability and validity for the German version of the iHOT-33. However, some authors have made critical remarks about the methodical quality of the development study of the iHOT33 and advised the separate validation of the iHOT33 subscales [4, 13, 35, 36].

Study design and population

Our demographic data are comparable to other studies on young active patients generally undergoing hip arthroscopy with an average age around 35 years [68, 12, 29, 30]. To get a more heterogeneous patient sample, we did not preselect patients according to their diagnosis or intended treatment. Aiming for validation data for various hip diseases, we included all patients with a hip disorder and an activity level greater than 4 on the Tegner activity scale. Compared to the original publication of the iHOT33 by Mohtadi [9], we chose more liberal inclusion criteria sparing a limitation of age. We nonetheless excluded patients with a disorder of the back or the contralateral lower extremity or a mental disorder to avoid confounding. The number of patients included in our study is according to previous recommendations [23].

The translation process was conducted according to the guidelines of the American Academy of Orthopedic Surgeons (AAOS) Outcomes Committee [14]. The period of time between test and retest was chosen to be a minimum of 2 weeks as recommended in the COSMIN checklist [22]. The validation was carried out following the complete COSMIN checklist [10, 22]. Along with the prospective multicenter design, the study meets high methodological standards with a level of evidence Ib.

Reliability

The excellent correlation coefficient for Cronbach’s alpha outlines the quality of the German iHOT33 and confirms the results of prior validation studies on the iHOT33 [9, 12, 18, 33]. Accordingly, an ICC of 0.88 confirmed good test–retest reliability. Low values for measurement error and smallest detectable change (SDC) indicate that small clinical changes can be detected not only at group level but also at the individual level [23]. Harris-Hayes et al. [13] contended that one current limitation of the iHOT33 is that the subscales concerning symptoms/function, sports activity, social, and occupational issues were not validated separately. The present study is the first to give a Cronbach’s alpha for each subscale and therefore confirm a positive rating for internal consistency [9, 12, 18, 33].

Validity

For the evaluation of the construct validity, the HOS and EQ-5D seemed most appropriate because they are applicable in this patient population and validated in the German language [17, 20, 21]. The HOS was also developed for patients undergoing hip arthroscopy [17]. Therefore, it seemed a viable instrument to evaluate construct validity. We determined a sufficient level of activity by including only patients with an activity level greater than four on the modified Tegner Activity Scale (mTAS) as described by Mothadi et al. [9]. Because the mTAS is solely derived from the level of sports activity, it is rather robust to smaller changes of the medical condition. Therefore, we expected rather low correlations between the functional hip scores and the mTAS.

Responsiveness and Interpretability

In our population, the minimal important change (MIC) according to the method of Norman et al. [25] was 12.7. This indicates a reasonable ability to transform a qualitative effect into a quantitative score while providing a high discriminatory power. Using this method is controversial in the literature [4, 7, 19, 37, 38]. Norman et al. [25] suggested an estimation of the MIC by division of the standard deviation by two. This method was derived from the effect size when undergoing clinical intervention. Based on Terwee et al. [23], a positive rating for responsiveness can be assumed when SDC is greater than MIC. With an anchor-based method, the patient’s report a current state of health targeting for their personal expectations [4, 37, 38]. Therefore, calculation of MIC can be problematic in absence of a therapeutic gold standard. The COSMIN checklist does not contain any recommendations on the estimation of the MIC. Still, there is no consensus about how to assess the MIC. Although the 95 % CI of the “unchanged” and “somewhat better” group is wide, a remarkable difference exists only within a small range of 0.21–0.63. The cutoff point that separates the “unchanged” group from the “somewhat better” group has to be outside of the 95 % CI of the “unchanged” group. It should also represent the smallest acceptable effect size for the “somewhat better” group. Accordingly, an estimated ES of around 0.6 seems reasonable to approximate the true ES of the iHOT33. Consequently, the minimal important change that reflects a clinically relevant improvement is approximately ten points on the iHOT33. Defining the MIC of the iHOT-33 is a main aspect of this study.

Our findings according to the global rating of change also confirm a strong correlation between patient perception and score result of the G-iHOT33. Nevertheless, a change of “somewhat better” or “somewhat worse” was not statistically significant compared to a steady state.

None of the iHOT33 subscales showed floor- or ceiling-effects. These results are concordant with previous results on interpretability of the iHOT33 [4, 9, 12, 18, 33].

Limitations

Despite good results concerning validity, reliability, and responsiveness of the G-iHOT33, there are some limitations to this study. Firstly, there was no additional score to correlate with the occupational subscale of the iHOT33. Therefore, the occupational subscale could only be correlated with the patient’s subjective rating showing reasonable correlation.

Another limitation is that due to a lack of a therapeutic gold standard, the minimal important change could only be estimated. To date, the only study providing longitudinal results on the iHOT is Mas Martinez et al. [8]. They report on short-term outcomes after hip arthroscopy in FAI with a minimum follow-up of 12 months. Unfortunately, data concerning minimal clinically important change are not specified in this study. Further prospective studies on longitudinal measurement properties of the iHOT33 are needed.

Conclusion

The German version of the iHOT33 provides good validity, reliability, and responsiveness for the functional evaluation of physically active patients with a hip disorder. This is the first study to present validation results for the iHOT33 subscales. Another main aspect of this study is the definition of the minimal important change (MIC) which is 10 points on the iHOT-33 scale. The COSMIN checklist is a feasible guideline to assess psychometric properties of patient-reported outcome measurements.