Introduction

Pelvic organ prolapse (POP) is defined as the symptomatic descent of one or more of the anterior vaginal wall, the posterior vaginal wall, and the apex of the vagina (uterus or vault) or uterus [1]. Other pelvic floor dysfunctions (PFD) often coexist with POP, such as lower urinary tract, bowel and sexual dysfunctions. POP and other PFD affect a substantial proportion of women [2] and can often cause bothersome symptoms and have a negative effect on psychological and social wellbeing [3]. To better understand a patient’s condition and the effect of treatment, patient-reported outcomes such as condition-specific health-related quality of life (HRQoL) are often assessed [3].

Two common instruments used for this purpose are the Pelvic Floor Distress Inventory and the Pelvic Floor Impact Questionnaire [3], for which abbreviated 20-item (PFDI-20) and 7-item (PFIQ-7) versions, respectively, have been validated to reduce the burden on participants [4]. Both have been tested for reliability, validity and responsiveness to change against their original longer counterparts, and demonstrated moderate to excellent associations [4, 5]. The PFDI-20 assesses the presence of symptoms and bother in three domains (POP, bowel and urinary), and the PFIQ-7 assesses the impact on HRQoL in these domains. Both the PFDI-20 and PFIQ-7 are designed to evaluate the efficacy of therapy and have been shown to discriminate between women with and without improvement following treatment [4, 5].

The PFDI-20 and PFIQ-7 are highly recommended (grade A) [6] and although validated in several languages [79], there are as yet no Norwegian versions. Therefore, the aims of the current study were to translate the PFDI-20 and PFIQ-7 into Norwegian and test their measurement properties (reliability, validity and responsiveness to change) in a prospective longitudinal study of women with POP and PFD in the tertiary setting.

Materials and methods

Ethics approval

Approval was granted by the regional committees for Medical and Health Research Ethics (Norway) and the Flinders University Social and Behavioural Research Ethics Committee (Australia). Permission was also granted by the developer of both instruments. Written informed consent was obtained from all participants.

Translation and cultural adaptation

The PFDI-20 and PFIQ-7 were first translated from English into Norwegian using a multistep translation and cultural adaptation method. This new method combined the European Organization for Research and Treatment of Cancer (EORTC) Quality of Life Group Guidelines [10], the Delphi method [11, 12] and expert panel review [13]. It involved two independent forward and back translations [10], with the addition of the Delphi method [11, 12] (i.e. anonymous voting, controlled feedback, and statistical group response), to establish consensus on the translated items among a panel of bilingual pelvic floor experts comprising gynaecologists, colorectal surgeons, a urologist, a physiotherapist and a urotherapist [13]. The translated instruments were then pilot tested for comprehensibility, readability and equivalence through face-to-face semistructured interviews with 20 women with POP (with or without urinary or bowel dysfunction). Minor discrepancies were identified and amended, resulting in comprehensible Norwegian versions of the PFDI-20 and PFIQ-7 with readability level at a reading age of 12 years. These are included in the Appendices 1 and 2.

Participants and procedure

Participants were patients recruited through the Department of Obstetrics and Gynaecology of Akershus University Hospital, Norway, from June 2014 to September 2015. Two cohorts were included: those with POP (nonsurgical patients), and those undergoing surgery for POP (surgical patients; Table 1). For inclusion nonsurgical patients had to be referred to the Outpatient Department with symptomatic POP (with or without urinary or bowel dysfunction), while the surgical patients had in addition to have anatomic POP Quantification (POP-Q) [14] stage 2 – 4 and to be scheduled for vaginal repair.

Table 1 Baseline characteristics of the participants and summary statistics for key study variables

Exclusion criteria were age less than 18 years, inability to understand Norwegian and/or complete a patient-reported outcome questionnaire, and visual impairment. The sample size was based on the Consensus-based Standards for the Selection of Health Measurement Instruments (COSMIN) recommendations of a minimum of 50 participants for every subgroup analysis except internal consistency [15], which was based on a subject-to-item ratio of at least 4:1 (minimum 108 participants) [15].

The participants completed the PFDI-20, PFIQ-7 and the SF-36v2 Norwegian Health Survey (SF36) [16] at baseline (T0), and a subsample completed the questionnaires 1 – 3 weeks later (T1). This interval was chosen on the assumption that it would be short enough for the participants’ POP condition to remain unchanged, but long enough to ensure that they would not recall their T0 responses. Patients scheduled for POP surgery also completed the questionnaires 6 months after surgery (T2). At T0 participants provided sociodemographic data (age, gender, parity), body mass index and previous surgery data as sample descriptors. A POP-Q examination was performed at both T0 and T2. Figure 1 shows a flow chart of patient recruitment and participation.

Fig. 1
figure 1

Flow chart of patient recruitment and participation

Measurement instruments

The 20-item PFDI-20 measures symptom distress during the past 3 months. Responses are given on a scale ranging from 0 (‘no’) to 4 (‘yes, quite a bit’) [5]. Three subscales are also available: the Urinary Distress Inventory (UDI-6), the Pelvic Organ Prolapse Distress Inventory (POPDI-6), and the Colorectal–Anal Distress Inventory (CRADI-8). The total score is converted to a range of 0 to 300, and the subscales are scored 0 to 100. In all cases higher scores indicate greater distress. The seven-item PFIQ-7 measures HRQoL issues in women with PFD (including daily physical/social activity, travel, and emotional health) during the past 3 months. Responses are given on a scale ranging from 0 (‘not at all’) to 3 (‘quite a bit’). The PFIQ-7 also has three subscales: the Urinary Impact Questionnaire (UIQ-7), the Pelvic Organ Prolapse Impact Questionnaire (POPIQ-7), and the Colorectal–Anal Impact Questionnaire (CRAIQ-7) [5]. Again, the total score is converted to a range of 0 to 300, and the subscales are scored 0 – 100. Higher scores indicate greater symptom distress and impact on the patient’s HRQoL [5]. The SF36 is a multipurpose generic health outcome measure comprising 36 items. For the current study, only the Physical Component Summary (PCS) score and the Mental Health Component Summary (MCS) score are reported. For both PCS and MCS, lower scores indicate poorer health [17].

At retest (T1), participants were asked if their condition had changed during the interim period [18] with the question ‘Compared to the first time you completed the questionnaires, has your prolapse condition changed?’ (if ‘Yes’, women were excluded from the retest). At T2, participants were also asked ‘In general, how much did the treatment improve your pelvic organ prolapse?’(global rating of change, GRC). Responses are given on a six-point scale from ‘improved significantly’ to ‘no significant improvement’ [18].

Statistical methods

Analyses were conducted using SPSS version 22.0 (IBM Corp., Armonk, NY). Statistical significance was assumed at p < 0.05. COSMIN recommendations were used as a guide for evaluating the measurement properties of the Norwegian PFDI-20 and PFIQ-7 [19, 20]. First, floor and ceiling effects were examined and considered problematic if more than 15 % of participants achieved the highest or lowest possible score [15]. Missing data at the item level were also noted. Based on COSMIN recommendations, <3 % is acceptable and >15 % is unacceptable [18]. Cronbach’s alpha was calculated for PFDI-20 and PFIQ-7 scores as a measure of internal consistency (the degree of interrelatedness among the items [20]). A value of 0.70 or greater is considered to indicate adequate internal consistency [15, 21].

Test–retest reliability (the degree to which a measurement is free from error [20]) was evaluated using intraclass correlation coefficients (ICCs) to quantify the agreement between PFDI-20 and PFIQ-7 scores [15, 19]. ICCs were calculated according to the method of McGraw and Wong [15]. Coefficients of at least 0.70 are considered adequate [15, 18]. Measurement error (the systematic and random error of a patient’s score that cannot be attributed to true changes in the construct being measured) [20] was also assessed. It is considered acceptable when the smallest detectable change (SDC; 1.96 × √2 × SEM, where SEM is the standard error of measurement) is smaller than the minimal important change (MIC) [15]. SEM was calculated as the square root of the variance from analysis of variance, including systematic differences (SEM agreement) [15].

The degree to which the scores of a measurement instrument are consistent with hypotheses based on the assumption that the measurement instrument validly measures the construct to be measured [20] was assessed by testing eight hypotheses expressed in terms of the expected direction and magnitude of the effect (Table 2). Correlations were calculated between the PFDI-20 and PFIQ-7 scores and the SF36 at baseline [15]. Both convergent and divergent validity were tested [18], with the expectation that correlations between related constructs would be high, while those between unrelated constructs would be low or non-existent [18]. Coefficients were arbitrarily considered low (<0.30), moderate (0.30 – 0.59) or high (≥0.60).

Table 2 Confirmation or rejection of baseline hypotheses

Responsiveness to change (the ability to detect change over time in the construct being measured [20]) of the PFDI-20 and PFIQ-7 was assessed by addressing five hypotheses (Table 3), tested by correlating changes in PFDI-20 and PFIQ-7 scores with changes in SF36 scores [18]. Each questionnaire was considered responsive if at least 75 % of the relevant hypotheses were supported [15]. It was expected that correlations among related constructs would be higher than among unrelated constructs [18]. Compared with the PFDI-20 and PFIQ-7, the SF36 should be relatively unresponsive to change in women undergoing POP surgery [5]. Further, receiver operating characteristic (ROC) curves were constructed and the areas under the curves (AUC) calculated [18]. Changes in scores between T0 and T2 were calculated. After surgery, patients who reported being ‘much improved’ or ‘greatly improved’ in their responses to the GRC [22, 23] were classified as ‘improved significantly’ while those who reported ‘little improvement’ or ‘no change’ were classified as ‘no significant improvement’ [18] (Table 4). Women who reported deterioration in the GRC were excluded from the responsiveness analyses. The PFDI-20 and PFIQ-7 were considered to be responsive to change if AUCs exceeded 0.70 [18].

Table 3 Confirmation or rejection of responsiveness hypotheses
Table 4 Responsiveness and interpretability of the PFDI-20 and PFIQ-7 in terms of the changes in total scores from T0 to T1 in 76 women completing the 6-month follow-up (T2)

The MIC, a measure of the interpretability of the change in score, was also calculated [18]. It was determined by the anchor-based MIC distribution, using the ROC approach [18]. The optimal ROC cut-off points were taken as the value for which the sum of the proportions of misclassification, i.e. (1 − sensitivity) + (1 − specificity), was smallest [9]. The MIC must be bigger than the SDC for a change in score to be distinguishable from measurement error. Interpretation of change scores was tested using the anchor-based MIC distribution method to assess which changes from PFDI-20 and PFIQ-7 total scores correspond with the MIC defined on the anchor (i.e. GRC), which distinguished patients who had ‘improved significantly’ after surgery from those who showed ‘no significant improvement [18].

Results

During the study period 716 consecutive patients were referred to the outpatient clinic for POP. Of these, 424 (58 %) did not meet the inclusion criteria or declined to participate. A further 80 (13 %) were not invited to participate for logistical reasons (Fig. 1), leaving 212 eligible women (29 %) who consented to participate. Of these, 205 completed the questionnaires at T0 giving an excellent response rate of 96.7 %. A subsample of 56 women (27.3 %) completed questionnaires at T1. Of the 96 women undergoing surgery, 76 (79.1 %) completed the questionnaires at T2. The retest evaluation (T1) was completed a median of 11 days (range 6 – 21 days) after T0. At T1 six patients indicated a change in the symptoms and severity of their POP and were not considered further in the study (Fig. 1). The T2 evaluation was completed a median of 184 days (range 153 – 189 days) after T0.

The median age of the women was 61 years (range 27 – 82 years). The majority of women with POP had POP-Q stage 2 or 3. Anterior compartment prolapse was the most common type of POP. Several women had POP in more than one compartment. Women who were treated surgically underwent only vaginal repair. Anterior and posterior compartment repair were the most common procedures (Table 1). Of the 205 women, 172 (83.9 %) completing the PFDI-20 reported symptoms in all three PFD domains, 27 (13.2 %) reported symptoms in two PFD domains, and 6 (2.9 %) reported symptoms in only one domain. All 205 women completing the PFDI-20 reported symptoms of POPFootnote 1, 192 women (94 %) reported lower urinary tract symptomsFootnote 2 and 184 women (88 %) reported bowel symptomsFootnote 3.

Evaluation of measurement properties

No floor or ceiling effects were found in the distributions of the PFDI-20 and PFIQ-7 total scores (Table 5). Similarly, no ceiling effect was observed for any of the PFDI-20 or PFIQ-7 subscales. However, the UIQ-7 subscales showed small floor effects, while major floor effects were noted for the POPIQ-7 and CRAIQ-7 subscales.

Table 5 Floor and ceiling effects of baseline scores

Missing data at baseline were associated with only 0.82 % of PFDI-20 items and 1.92 % of PFIQ-7 items. Cronbach’s alpha for the PFDI-20 and PFIQ-7 total scores was 0.83 and 0.93, respectively, demonstrating very satisfactory internal consistency. Similarly, subscale coefficients (Table 6) were generally satisfactory to excellent, with the exception of POPDI-6 (0.66). In all cases, for both scales, test–retest ICCs (Table 6) indicated adequate reliability (p < 0.001 for all coefficients). The SDC at the individual level was 16.7 (16.7 %) to 26.3 (26.3 %) for the PFDI-20 subscales (range 0 – 100), and was 46.1 for the PFDI-20 total score (range 0 – 300), i.e. a relative SDC of 15.3 % of the total score. For the PFIQ-7, the SDCs were slightly larger. The SDC was 26.1 (26.1 %) to 27.2 (27.2 %) for the PFIQ-7 subscales (range 0 – 100), and was 62.1 for the PFIQ-7 total score (range 0 – 300), i.e. a relative SDC of 20.7 % of the total score (Table 6).

Table 6 Internal consistency and test–retest statistics

Construct validity was adequate, with 88 % of predefined hypotheses (seven of eight) confirmed (Table 2). The exception was the association between POPDI and POPIQ-7, with only a moderate positive correlation (0.58). In all other cases, as hypothesized, measures of the same construct provided high positive correlations. Further, scales measuring similar, but not equivalent, constructs showed moderate correlations, and scales measuring unrelated constructs showed low correlations (Table 2).

Responsiveness was adequate, with 100 % of the predefined hypotheses (five of five) confirmed (Table 3). Change in scores measuring the same construct showed high positive correlations, those measuring similar but not equivalent constructs showed moderate negative correlations, and those measuring unrelated constructs showed low correlations. Responsiveness to changes in PFDI-20 scores was further supported by AUC values of ≥0.70, whereas the AUCs were lower for changes in PFIQ-7 scores (Table 4). The MIC for the PFDI-20 total score (0 – 300) was 48, which was slightly larger than the SDC (46.01; Table 6). This suggests that an improvement in PFDI-20 score of ≥48 can be regarded as a clinically relevant change. Patients who had ‘improved significantly’ on the GRC 6 months after surgery achieved a mean change of 63, indicating clinically relevant improvement. The absolute value of MIC for the PFIQ-7 total score (0 – 300) was 47, which was smaller than the SDC (62.1; Table 6). Hence, a score of ≤47 points cannot be considered a clinically relevant improvement. While such a change may be considered important by the patient, it cannot be distinguished from measurement error.

Discussion

Norwegian translations of the PFDI-20 and PFIQ-7 were found to have adequate reliability (test/retest reliability, and internal consistency), validity and responsiveness to change in a homogeneous sample of women at baseline and after surgical treatment. As predicted [5], all retest assessments of the PFDI-20 and PFIQ-7 showed adequate reliability. In general, internal consistency was at least adequate, with the exception of the POPDI-6, for which internal consistency was found to be less than adequate (0.66). Interestingly, some cross-cultural adapted versions have shown a similar issue for the POPDI-6 [7, 9].

As in Swedish and Dutch studies [7, 9], no ceiling effects were found for total or subscale scores of these measures. However, as floor effects were found in the PFIQ-7 POPIQ and CRAIQ-7, it is suggested that the PFIQ-7 should be interpreted in terms of both the total score and subscale scores. This supports the findings of the Dutch study, which found similar floor effects [9]. The authors pointed out that patients can experience various types of PFDs, but might not experience all associated symptoms (e.g. POP and defecation problems without urinary incontinence) [9].

Responsiveness was high for PFDI-20 and moderate for PFIQ-7. Thus, the PFDI-20 exhibited a better ability to capture change. For the ROC curve analysis, the patients were divided into two groups: ‘no significant improvement’ and ‘improved significantly’. During sensitivity analysis using the ROC method, two patients who reported ‘no change’ were included in the combined ‘no significant improvement’ category. Further, we redefined minimal importance and dichotomized GRC as ‘improved slightly’/‘much improved and improved greatly’ [24]. The dichotomization into the two GRC categories resulted in similar responsiveness for the PFDI-20 and PFIQ-7. Moreover, the results for the PFDI-20 were similar to those in a Danish translation study, which also showed that the instrument has adequate responsiveness to change [24].

GRC might be seen as not measuring the same constructs as the PFDI-20 and the PFIQ-7 scales. However, Gelhorn et al. [22] consider that the PFDI-20, PFIQ-7 and GRC (which they refer to as Patient Global Impression of Change) are sound external measures of patients’ perception of change. The PFDI-20 showed a MIC of 48, which is similar to the minimally clinically importance difference of 45 points found by Barber et al. [5]. The PFDI-20 can detect clinically relevant improvement, whereas the measurement error of PFIQ-7 was too large to detect clinically relevant improvement. The Dutch studies found similar results for both the PFDI-20 and PFIQ-7 [9].

Some caveats to the interpretation of the current results should be acknowledged. First, a limitation was the recruitment of only those women with symptomatic POP (with or without urinary or bowel dysfunction). That is, women with only urinary or bowel dysfunction were not recruited. However, both urinary and bowel dysfunction were present with high frequency in the total sample, with only six participants (2.9 %) reporting having exclusively POP. In terms of psychometrics, validation data were collected only within a tertiary setting, which limits generalizability. Further validation studies in more general contexts are therefore recommended. Further recommendations include responsiveness testing for conservative treatment, and establishing confirmatory factor analysis and clinically meaningful interpretations of PFDI-20 and PFIQ-7 total scores and subscales. Educational level was not included in the baseline characteristics and the study was not able to demonstrate if the questionnaires could be understood by women of all educational levels. Moreover, during the pilot test sexuality was an aspect identified as important to patients and not covered in the PFDI-20 and PFIQ-7. Employing a third measuring instrument covering sexuality issues for women with PFD should also be considered [25]. Finally, validation of electronic administration versions of the PFDI-20 and PFIQ-7 is also recommended in clinical and research settings [26]. Electronic administration may encourage higher survey response rates and, hence, reduce nonresponse bias.

Conclusions

The translated and validated Norwegian versions of the PFDI-20 and PFIQ-7 are effective measures of symptom distress and quality of life among Norwegian women with POP and PFD. The PFDI-20 exhibited a better ability to capture changes than the PFIQ-7. The use of these instruments in the clinical and research settings will provide data that could lead to better patient management and policy decisions in Norway.