Introduction

Pelvic floor dysfunction (PFD) is a clinical syndrome characterized by a variety of symptoms relating to pelvic organ prolapse (POP), urinary incontinence (UI), fecal incontinence (FI), and difficulty in emptying and sensing the lower urinary tract. The first three symptoms are the most common, with separate incidence rates of 11.4–9.56%, 30.9% [1], and 1.28% [2] in China. PFD is not fatal but significantly impacts women’s quality of life (QOL).

With increasing attention paid to pelvic floor disorders, a series of validated self-administered QOL questionnaires have been developed to access individual symptoms [3]. The Pelvic Floor Distress Inventory-20 (PFDI-20) is a condition-specific short form of the Pelvic Floor Impact Questionnaire (PFDI), which was recommended by the International Continence Society (ICS) as a class A questionnaire and is currently widely used [4]. The questionnaire not only contains all items of the Urinary Distress Inventory (UDI), which was frequently used in assessing UI several years ago, but also includes items regarding POP and anorectal dysfunction. It has been divided into three subscales related to POP symptoms, colorectal symptoms, and urinary symptoms. PFDI-20 has been recently translated and validated in many different countries [5,6,7,8,9,10,11]. In China, the Pelvic Floor Impact Questionnaire-7 (PFIQ-7) was validated in 2011 by Zhu L et al. [12], yet there was no Chinese version of the PFDI-20. Measurement properties of the instrument were evaluated according to the Consensus-Based Standards for the Selection of Health Status Measurement Instruments (COSMIN) checklist [13]. The checklist is based on an international Delphi study in 2010, and it is used to evaluate the methodological quality of studies on health status measurement instruments [14].

The objective of this study was to translate the short version of the PFDI-20 into Chinese and to evaluate its psychometric properties in Chinese women with symptomatic PFD according to the COSMIN checklist.

Materials and methods

Questionnaires

In addition to the PFDI-20, the following questionnaires were used in this study:

  • Pelvic Floor Impact Questionnaire

The PFIQ-7, including three corresponding subscales (urinary, colorectal–anal and POP) assesses the impact of the condition on four aspects of patient QOL (physical activity, travel, social/relationships, and emotional health). The Chinese version was validated by Zhu et al. [12] and has high reliability and validity in the Chinese population.

  • Subjective assessment

The Patient Global Impressions of Improvement (PGI-I) questionnaire is a one-item questionnaire that asks patients to rate the perceived change in response to therapy. Patients are asked to check the one number that best describes how their symptoms are now compared with how they were before surgery. A 7-point scale score was given as very much better, much better, a little better, no change, a little worse, much worse, or very much worse. It has been validated in clinical studies with stress urinary incontinence (SUI) [15] and urogenital prolapse [16].

Translation process

To maintain as much original meaning as possible, the PFDI-20 translation contains two dependent forward and backward translations [17]. First, the English version was translated into Chinese independently by two bilingual translators. These two versions were reviewed by a group of researchers to increase the face validity of the questionnaire. Second, the resultant translation was back-translated into English by two other bilingual experts. Finally, a consensus was established by a panel of bilingual translators and clinical experts. A pretest of 50 eligible PFD patients was performed to determine whether the questionnaire had unclear or vague items and whether its completion time was appropriate. We then synthesized and developed the final version, the translation process of which was modified based on cross-cultural adaption (see Appendix for the final Chinese version).

Validation study

A cross-sectional study was conducted between October 2017 and May 2018 to evaluate the reliability, validity, and responsiveness of the PFDI-20. Inclusion criteria were a diagnosis of PFD or UI and sufficient reading and comprehension abilities. Patients were excluded if they had chronic inflammation and organ lesions. All patients who completed the PFDI-20 and PFIQ-7 in an outpatient setting were grouped as T1 patients. After 1–2 weeks, (T2) they were asked to complete the questionnaire again by telephone if there were no symptomatic changes and no interventions were taken. In the third round, patients undergoing surgeries were asked to complete the PFDI-20 and PGI-I to evaluate responsiveness (T3). The validation process is shown in Fig. 1.

Fig. 1
figure 1

Verification process of the Pelvic Floor Distress Inventory-20 (PFDI-20)

Statistical analysis

Statistical analysis was performed using the SPSS software package (version 23.0, SPSS Inc., Chicago, IL, USA). Descriptive data are presented as means ± standard deviation (SD) or medians (25th percentile, 75th percentile). The chi-squared test was used for univariate associations, and the Mann–Whitney U test was used for comparisons of independent groups. P<0.05 was considered significant. Psychometric properties were evaluated as recommended in the COSMIN checklist. Methodological testing, including reliability, validity, and responsiveness, was assessed.

Reliability (internal consistency, test–retest reliability)

Reliability, free from measurement error, concerns the ability to distinguish patients from each other. We calculated internal consistency, test–retest reliability, and measurement error to evaluate reliability. Cronbach’s alpha was calculated for PFDI-20 scores as a measure of internal consistency. A value of ≥0.70 WAS considered adequate [14]. Test–retest reliability was evaluated with intraclass correlation coefficients (ICCs) to quantify agreement of total and subscale scores. The preferable range for ICC is >0.70 [14].

Measurement error

Measurement error, the systematic and random error of a patient’s score that is not attributed to true changes in the construct, can be expressed as the standard error of measurement (SEm) and the smallest detectable change (SDC). Data from T1 and T2 were used to determine measurement error. We assumed there would be no real change in a patient’s level of function in a 1- to 2-week interval, and change scores presented as normally distributed and close to zero. The SEm represents the SD of repeated measures in one patient and was calculated using the square root of the error variance [13]. The SDC represents the smallest individual change that a patient needs to show on the scale to ensure that the observed change is real. The SDC was calculated using the formula SDC = 1.96 ×\( \sqrt{2} \)×SEM/\( \sqrt{n} \) [13].

Validity (face validity, construct validity, criterion validity, and hypothesis testing)

Face/content validity was assessed by an expert panel and/or patient focus group during development of the original questionnaire and the Chinese translation version. To evaluate criterion validity, Spearman’s correlation coefficient was calculated between total/subscale scores of the PFDI-20 and a related criterion standard, the PFIQ-7 questionnaire, which had already been validated in our country for the assessment of PFD [12]. Corrected item-total correlations >0.70 were considered evidence of validity. Factor analysis was used as a tool for estimating construct validity. We hypothesized that patients who had UI and lower urinary tract symptoms (LUTS) would have higher UDI-6 scores than those who did not and that patients who experienced POP would have higher POPDI-6 scores than those who did not. Hypothesis testing was adequate if 75% of these hypotheses were confirmed, and the sample size of each group was required to be >50 [14]. In addition, floor and ceiling effects and percentage of patients obtaining minimum and maximum scores were calculated, and >15% was considered problematic.

Responsiveness

Responsiveness is the sensitivity of the PFDI-20 to clinically significant changes. A comparison between SDC and MIC was carried out to demonstrate responsiveness. SDC less than MIC was considered adequate. An anchor-based method was used to calculate the MIC. T3 patients were asked to complete the PGI-I and were grouped as the answered rate. Comparisons between different groups were conducted to demonstrate that the instrument can detect clinical changes.

Results

Study population and protocol

Between October 2017 and May 2018, 150 patients were invited to participate in the survey. A total of 126 completed all questionnaires, and data analysis was conducted. Mean patient age was 58.7 ± 10.5 years. Thirty-nine (29.8%) had symptoms of UI, one of whom had anal incontinence. Sixty-four participants (48.9%) had LUTS. A total of 89 patients (67.9%) felt vaginal/uterus prolapse when performing physical work or in the resting state. Among all participants, POP-Q III was the most prevalent finding (57.3%). Demographic data and score results are shown in Table 1. Seventy-five patients were selected randomly for retest analysis to complete the questionnaires again 1–2 weeks later (this interval is considered short enough to avoid changes in presenting symptoms and long enough for patients to forget their previous responses), while 24 of them were excluded for undergoing conservative treatment, such as Kegel’s exercises or drug therapy. Of the original respondents, 80 patients underwent PFD surgery, including procedures such as vaginal hysterectomy, anterior/posterior colporrhaphy, laparoscopic sacrocolpopexy, colpocleisis, Total Prolift System surgery, and tension-free vaginal tapes; 74 required a follow-up visit after 3 months (response rate = 92.5%). Questionnaires containing missing items or unclear individual information were excluded. Baseline information of six patients who were withdrawn in T3 were secondarily analyzed, and there was no significant difference between them and the original respondents. All missing data occurred randomly.

Table 1 Characteristics of the 126 participants

Reliability

The value of Cronbach’s alpha was adequate for internal consistency in the PFDI-20 (α = 0.88), POPDI-6 (α = 0.77), UDI-6 (α = 0.80), and CRADI-8 (α = 0.84) (Table 2). Meanwhile, there was no sign of growth by deleting any item. In the test–retest analysis, the instrument showed good reliability. The total PFDI-20 showed an ICC of 0.997, and a range from 0.994 to 0.997 was found in its subscales (Table 2). The SEm was 49.1, and the SDC indicating the smallest individual change was 18.36.

Table 2 Internal consistency and reproducibility of the PFDI-20

Validity

Content validity

The floor/ceiling effect is also an important component of content validity. Three patients (0.02%) scored the maximum score of 300, which rejected the presence of a floor effect for PFDI-20. There was no relevant ceiling effect, because no patients scored zero (0.00%).

Criterion validity

The assessment of criterion validity was analyzed by the correlation between scores on the PFDI-20 and the PFIQ-7 (Table 3). Spearman’s correlation coefficient between the two questionnaires was 0.87 and ranged from 0.56 to 0.81 on the subscales, demonstrating good criterion validity for PFDI-20. Meanwhile, in this instrument, the total score correlated well with its respective subscales.

Table 3 Spearman’s correlation coefficients between PFDI-20 and PFIQ-7

Construct validity

Confirmed factor analysis (CFA) with Varimax rotation was employed to assess construct validity. As shown in Table 4, data were suitable for factor analysis by the Kaiser-Meyer-Olkin (KMO) and Bartlett test (KMO measure 0.821). CFA provided five factors explaining 69.55% of the variance accumulatively (cutoff point eigenvalue >1.0), which indicates a good factor extraction. In Table 5, the five factors explained 22.12%, 40.22%, 53.67%, 62.84%, and 69.55% of the variance. Table 5 gives factor loading of the Varimax-rotated five-factor solution. Questions 7–13 had high factor loadings on the first factor, which could be explained as colorectal–anal distress. Questions 1–3, 5, 15, 19, and 20 loaded high on the second factor, which could be classified into direct feelings of organ prolapse and lower urinary tract obstruction or irritation symptoms. Questions 16–18 had high loading on the third factor and could be classified as various types of UI. Questions 4 and 6 belonged to a fourth factor: excretion with external force. Only question 14 independently belonged to the fifth factor: rectocele symptoms. All factor contributions of the variance ranged from 45 to 90%. Although this structure is not exactly the same as the original version, its logical structure indicated that this instrument has good construct validity.

Table 4 Kaiser-Meyer-Olkin (KMO) and Bartlett test of confirmed factor analysis
Table 5 Results of factor analysis within the five PFDI-20 dimensions

Hypothesis testing

We hypothesized that patients with POP would have higher POPDI-6 scores than those without these symptoms. Patients with UI or LUTS had higher UDI-6 scores than those without those symptoms. Hypothesis testing was adequate if 75% of the hypotheses were confirmed. All predefined hypotheses were confirmed, as shown in Fig. 2. Validity analysis of the CRADI subscale was not included in the hypothesis testing because of the low prevalence of FI, which led to an inadequate number of patients.

Fig. 2
figure 2

Hypotheses testing. Pelvic Floor Distress Inventory (PFDI)-20 scores with comparisons between groups. The bold lines present the median, the box represents the interquartile range, and the whiskers represent the minimum and maximum scores

Responsiveness and interpretability

In the third investigation section, 80 participants received surgical treatment, 74 of whom completed the questionnaires (response rate = 92.5%) and were grouped by PGI-I scores. As shown in Fig. 3, there was a significant difference in scores in the very much better (p < 0.05) and much better groups (p < 0.05), and there was no statistical significance in scores in the no-change group (p = 0.10). The group responding a little better was excluded because of small sample size. In addition, the score difference was more obvious in the very much better group than in the much better group, indicating the ability to transform a qualitative effect into a quantitative one.

Fig. 3
figure 3

Responsiveness and interpretability. Comparison of Pelvic Floor Distress Inventory (PFDI)-20 scores before and after operation in each Patient Global Impression of Improvement (PGI-I) group. The bold lines present the median, the box represents the interquartile range, and the whiskers represent the minimum and maximum score

There was no gold standard for calculating the MIC according to the COSMIN checklist. MIC was estimated with an anchor-based method. The 95% confidence interval (CI) of effect size (ES) of the much better and no change groups was calculated; the cutoff point should be outside the 95% CI of the no change group and also be the smallest for the much better group. Therefore, the estimated ES in our population was 1.86 (Fig. 4), and the MIC of the PFDI-20 was 50.0. The Chinese version of the PFDI-20 showed a lower MIC value than the SDC, and the responsiveness was adequate based on Terwee et al. [14].

Fig. 4
figure 4

Estimation of effect size by indication of the 95% confidence interval (CI) for no change and much better groups

Discussion

The purpose of this study was to translate the PFDI-20 into Chinese and validate it in Chinese women. The psychometric properties included reliability, validity, responsiveness, and interpretability. Cronbach’s coefficient of 0.875 showed a satisfactory internal consistency of the PFDI-20. Similar values were reported in Japan, Brazil, and African countries. The second round of investigation was conducted by phone and used to assess the test–rest reliability. The PFDI-20 showed excellent test–retest reliability, with observed ICC values of 0.997.

Because there is no gold standard for PFD symptoms, we calculated correlations between PFDI-20 and the validated PFIQ-7 to estimate criterion validity. Spearman correlation coefficient of 0.867 presented an adequate result, and there were significant correlations between subscale scores on both instruments, except the FI subscales (r = 0.559). Moreover, adequate correlations of questionnaires with similar structures of the UI, FI, and POP subscales further verified construct validity, similar to some validation studies, such as those conducted in Holland, Brazil, and some African countries [7, 8, 18]. In the study reported here, factor analysis was further employed to evaluate construct validity. Structure of the PFDI-20 is similar to that of the original version, the PFDI, which includes 46 items and three subscales (UDI, POPDI, CRADI). Among them, the UDI retains all three original structures (obstruction, irritation, and stress) described by Shumaker et al. [19] and was expanded using nine items related to LUTS, which are common in PFD patients. The POPDI consists of 16 items divided into three parts (overall, anterior, and posterior compartment); the CRADI consists of 17 items associated with lower gastrointestinal disorders, which are divided into four parts (obstruction, incontinence, pain/irritation, and rectal prolapse). The CFA eventually explained 69.55% of the item variance, while five factors were found to largely match the logical dimensions, strongly supporting the construct validity of the PFDI-20.

Evaluation of treatment efficiency is considered extremely important and, as the basis of modern evidence-based medicine, shows strong potential in clinical work. SDC and the MIC are the basis for patient-reported outcome measures (PROM) interpretability to determine if the observed changes can benefit patients. To determine the clinical significance of score changes at the individual level, measurement error needs to be assessed and should not be greater than the MIC. Otherwise, the observed change cannot be determined to be a real change, as the risk of measurement error is >5%. One method of reducing SDC is to decrease the measurement error by averaging the testing values of repeated measurements, which leads to an additional burden on patients and increases the change of recall bias. Improving the quality of questionnaires by adding or improving items may be an alternative approach.

There is currently no consensus on choosing an anchor point for calculating the MIC. The slightly improved group was considered to effectively reflect the smallest important clinical change. Some studies have used the 15-point scale to apply the average change in the almost the same or slightly better or worse group to represent the MIC. We used the much better group of the 7-point PGI-I scale in this responsiveness analysis because only one patient was in the slightly improved group. Finally, adequate responsiveness was demonstrated with a higher MIC value (50.0) compared with SDC (18.36), indicating that we can determine 95% of the change caused by nonmeasurement error when the patient’s score change is greater than the MIC. Grouping comparisons by the PGI-I was also performed to assess preoperative and postoperative score changes. Results showed a statistically significant decreasing trend in the significant improvement and much better groups, but especially in the significant improvement group. No statistically significant difference in the no change group before and after operation was observed.

There were some limitations in this study. First, although there are no studies showing differences in MIC in distinct surgical or nonsurgical groups, some people believe that analysis and assessment between different groups is needed [20]. In our study, only patients undergoing surgical interventions were analyzed, and T3 patients should be divided into groups by intervention methods for future research. Second, although the anchor technique is considered to be the best method for evaluating MIC, its effectiveness and the best calculation methods remain controversial [21]. Furthermore, due to the small sample size of the slightly improved group, we chose the much better group as the second-best method of calculating MIC, which means it was inevitably greater than the true value. Finally, sample sizes of the UI and POP groups in the hypothesis testing were <50, meaning that those groups did not strictly meet inclusion criteria. There were not enough patients with FI to evaluate psychometric properties of the CRADI-8 subscale, meaning that validity of that criterion was unsatisfactory. Future research should focus on popularity of the PGI-I and enlargement of the sample size.

To conclude, the Chinese version of the PFDI-20 is a reliable and valid instrument and can considerably contribute to improvement in Chinese PFD patients.