Introduction

To assess the outcome following surgery for urinary incontinence (UI) and pelvic organ prolapse (POP) the importance of patient reported outcome measures (PROMs), in addition to the clinical objective measures, has been recognised [14]. PROMs are used to evaluate the effectiveness and quality of treatment in routine practice, to improve quality, benchmarking, and decision-making [4, 5]. However, critics argue that PROMs do not provide unambiguous answers about whether an intervention succeeds and thereby can be used as an evaluation tool for clinical interventions [68]. Different types of PROMs have been used in questionnaires following surgery for UI and POP. The International Consultation on Incontinence (ICI) initiated the development and evaluation of disease-specific questionnaires (ICIQ) [1, 912]. This scoring system is based on visual analogue scales (VAS) and the scoring of patients’ subjective symptoms in defined categories (Likert scales), and the ICIQ is the sum of a combination of these (Fig. 1). When asking the patients before and after an intervention, PROMs are used to compare the patient’s degree of satisfaction and improvement following the intervention. Alternatively, simpler PROMs with an inherent before–after assessment have been suggested, such as asking the Patient’s Global Impression of Improvement (PGI-I score). Such scoring systems have been validated for the evaluation of IU and POP [10, 11, 13, 14], and are widely accepted in the recent literature. To the best of our knowledge, no study has evaluated the scoring systems in a large body of data, and there is concern relating to possible recall bias for the PGI-I score compared with the traditional pre- and postoperative questionnaires.

Fig. 1
figure 1

Pre- and post-questionnaires. The International Consultation on Incontinence Questionnaire (ICIQ) score is based on the sum of questions A–C and the PGI-I score is question D [1, 12]

The Danish Urogynaecological Database (DugaBase) was established to monitor, ensure and improve the quality of urogynaecological surgery for all UI and POP surgeries in public and private hospitals in Denmark [15, 16]. Since its establishment in 2006, pre- and postoperative questionnaires regarding POP and UI surgery have been systematically collected. In 2013 the DugaBase was supplemented with a postoperative PGI-I score. The DugaBase is a national register with a high completeness, 95.0 % for UI and 91.3 % for POP surgeries [15].

The aim of this study was to compare the concordance of patients’ evaluation of surgery using the single postoperative PGI-I score versus the use of a pre- and postoperative ICIQ score system for women undergoing UI or POP surgery.

Materials and methods

This study is based on pre- and postoperative questionnaires completed by women aged 18 years or older undergoing surgery for UI or POP in Denmark in 2013. Definitions conform to the international joint report on terminology for female pelvic floor dysfunction and urinary incontinence [17]. Only those who completed both the pre- and postoperative questionnaires were included in the analyses.

Data sources

For all Danish hospital departments and private hospitals/clinics performing POP and UI surgery it is mandatory by Danish law to report data to the DugaBase and the data collection is based on a national web-based input module.

The DugaBase contains information on five areas: referrals; a pre-operative self-administered patient questionnaire based on the ICIQ scoring system; a pre-operative questionnaire completed by the gynaecologists including information on preoperative examination; information on surgical procedures; and finally, a post-surgery questionnaire consisting of the same self-administered questionnaires as those used before surgery, supplemented with a PGI-I score.

The ICIQ scoring system is based on visual analogue scales (VAS) from 0 to 10, and two symptom-specific Likert scales, and the ICIQ is the sum of a combination of scores (Fig. 1). Satisfaction of improvement is the difference between the pre- and post-surgery ICQI scores. The pre-surgery questionnaire is completed in connection with the preoperative examination. The post-surgery questionnaire is either sent to the patient 3 months after surgery or filled in by a nurse conducting a telephone interview with the patient. Question scores relevant for this study are presented in Fig. 1, and a detailed description of the database is available elsewhere [16].

Statistical analysis

All results are reported using descriptive statistics in numbers and means with 95 % confidence intervals. We computed the ceiling effect, the percentage of respondents who achieved the highest possible score, and we determined a cut-off point at 15 % as an acceptable ceiling for an operational scale [18]. The ceiling effect tells if a score only uses the top end of a scale; thus, changes in improvement would not be recognised.

In order to estimate the agreement between two methods of measurements (ICIQ and PGI-I), which express the same clinical intervention, a traditional correlation analysis was not appropriate. We therefore analysed the agreement of the PGI-I and ICIQ scores by converting these to a comparable scale ranging from −1 to 1, where 1 is the highest possible improvement and −1 is the lowest, and we further analysed the agreement as categorical variables and as if they were continuous variables. As categorical variables we calculated the inter-rater strength of agreement between the ICIQ and the PGI-I score by weighted Kappa statistics, using Altman’s definitions: poor (kappa value <0.21), fair (0.21–0.40), moderate (0.41–0.60), good (0.61–0.80), and very good (0.81–1.00) [19]. Considering the scores as continuous variables we used the 95 % limits of agreement method, which is also called the Bland–Altman plot [19, 20]. All calculations were performed using STATA Release 13.0.

Approvals

The DugaBase operates under the Danish law on data protection, with a license granted by the Danish Data Protection Agency and the Danish Health and Medicines Authority. This specific study has been approved by the Danish Data Protection Agency (Region Syddanmark: 2008-58-0035/sagnr. 14/15130). According to Danish law, ethical approval is not required for purely registry-based studies.

Results

Of the 5,476 women registered in the DugaBase in 2013, 3310 (60.4 %) were included in this study, 738 after surgery for UI and 2,581 after POP (9 women underwent both POP and UI concomitantly). Among the 2,166 excluded, 525 had not filled in the preoperative questionnaire, 1,141 the postoperative questionnaire, and 499 neither of them.

Overall, the PGI-I score showed higher improvement than the IQIC score on a converted comparable scale, PGI-I 0.83 (95 % confidence interval [CI]: 0.80–0.85) vs ICIQ 0.62 (CI 0.60–0.64) for UI, and 0.77 (CI 0.75–0.78) vs 0.66 (CI 0.65–0.67) for POP (Table 1). Among the subgroups, the elderly (>70 years) had a lower degree of improvement following UI surgery than other age groups at both scores, and the youngest (18–39 years) had the lowest degree of improvement after POP surgery at the ICIQ score (Table 1). The subgroup of women with a previous UI surgery intervention had a lower degree of improvement following UI surgery, which supports other reported studies of repeat IU surgery [21, 22].

Table 1 Pre- and post-surgery improvement at Patient’s Global Impression of Improvement (PGI-I) and International Consultation on Incontinence Questionnaire (ICIQ) scores (scores converted to comparable −1 to 1 scales)

The scatterplot of PGI-I versus ICIQ illustrates the relatively higher score for PGI-I (Fig. 2). Moreover, only few women reported a negative improvement after surgery, regardless of score. The dotted line in Fig. 2 illustrates where the two scores were identical. There was a higher concordance for UI than for POP, although the regression line did not coincide with the line of equality for both UI and POP. Figure 2 also illustrates the ceiling, which is especially high for PGI-I. For UI, the ceiling was 3.8 % for ICIQ and 69.9 % for PGI-I, whereas for POP it was 14.1 and 53.2 % (Table 2).

Fig. 2
figure 2

PGI-I in relation to ICIQ score (scores not converted to a −1 to 1 scale) flowing urinary incontinence and pelvic organ prolapse (unbroken line regression line, dotted line line of equality)

Table 2 Ceiling for PGI-I and ICIQ scores [18]

Using Kappa statistics the agreements between the PGI-I and the ICIQ score were fair for both POP and UI surgery interpreted by using Altman’s definition of strength of agreement (Table 3). We computed a Bland–Altman plot to show the differences in ICIQ and PGI-I against the mean for the same ICIQ and PGI-I scores (Fig. 3). The interpretation of the plot tells us that the scores were equivalent from −0.29 to 0.71 for UI and −0.57 to 0.79 for POP. The histograms showed that the difference in the scores was 0.21 for IU and 0.11 for POP, on a comparable −1 to 1 scale. They further showed an almost normal distributed difference, thus fulfilling the assumption for the 95 % limits of agreement method.

Table 3 Inter-rater agreement using weighted Kappa statistics for the improvement on the PGI-I and ICIQ score (scores converted to comparable −1 to 1 scales)
Fig. 3
figure 3

Difference against average and histogram of difference of PIG-I and ICIQ score, Bland–Altman plot; scores converted to comparable −1 to 1 scales)

Discussion

Main findings

In general, women who undergo surgery for UI and POP express a high improvement of their disease-related symptoms in addition to improved quality of their everyday life [12, 23, 24]. In this study we also found a high degree of improved patient satisfaction following UI and POP surgery, using both the ICIQ score and the PGI-I score. Nevertheless, we found that the PGI-I score was higher than the ICIQ score, and graphically we observed a bad correlation with the equality line and a fair inter-rater agreement using kappa statistics.

Strengths and limitations

The DugaBase is a national clinical database containing 92.2 % of all Danish POP and IU surgeries carried out in 12 private clinics and 23 public hospitals reporting in the same web-based data-entering system [15, 16]. The response rate being 60.4 % for answering both the pre- and the post-questionnaire, we found our body of data valid for the purpose of this study.

When comparing a new measurement with an established scoring system, it is necessary to test whether they agree adequately. Without a gold standard, analytical correlation models would have been misleading because the scores would have been measuring the same clinical intervention. Instead, we used a graphical approach to describe the relation between the established and the alternative measure [19, 20, 25]. The advantage of the 95 % limits of agreement method is that it does not give an answer to whether the scores are equivalent or not. Instead, the method shows that an interval between the scales is equivalent and measures a mean difference, and the researcher has to interpret whether the results are meaningful in a clinical content. We did not find that differences between the converted scores of 0.11 and 0.21, (corresponding to a 5 and 10 % relative difference) were acceptable from a clinical point of view. Even more critically, we found the 95 % limits of agreement, ranging from −0.57 to 0.79 for POP, showing that the interval of agreement between the scores ranged most of the scales.

Interpretation

There may be more reasons for these findings. First, it may have been be a matter of a simple recall bias when the women had to compare their symptoms and inconvenience before surgery indirectly when reporting the PGI-I measure. Second, it seems that the women’s answers reflect their current status and not a change in improvement; thus, the inter-rater agreement of the PGI-I matches their post-surgery ICIQ scores better than the scores of improvement (results not shown). Therefore, are they actually answering our questions? Apparently, they answer whether they were satisfied with the operation in general and do not compare their original symptoms and inconvenience.

Third, we cannot rule out that the differences may, at least partly, be due to different phrasing of the questions (Fig. 1). However, we find it unlikely that this covers the entire explanation because results for separate comparisons between PGI-I on the one hand and the separate Likert and VAS scales on the other (data not shown) were very similar to the overall results, which corroborates that the findings were more likely related to measurement issues. In this study we only focused on PROMs and did not implicate objective clinical measures, even if these measures were available and did not necessarily correlate with the PROMs [4, 26, 27], and objective clinical measures could therefore not have been a gold standard in comparison.

When designing and choosing PROMs of clinical interventions a number of criteria have to be fulfilled: reliability, responsiveness, interpretability, and response burden [18, 28]. Regarding the responsiveness of PROMs used in questionnaires, their applicability for evaluating changes is relevant. Only the ICIQ score showed an acceptable ceiling effect under 15 %; hence, the ability of the PGI-I score to detect improvement in clinical quality over time will be limited, although a deterioration in quality could be detected.

Conclusion

Questionnaires including questions based on the validated ICIQ are developed by the ICI [1, 29], and two studies suggest using a PGI-I score as a supplemental or surrogate measure for women undergoing surgery for UI and POP [10, 13]. The PGI-I score has been widely accepted [24, 30] and used in a study of incontinence disorders in men [14], as well as other incontinence disorders [31]. The question is, how reliable is a global measure of improvement for measuring the clinical quality of an intervention? Because only a post-surgery questionnaire is needed, the PGI-I reduces the response burden [18]. It is therefore tempting to use the PGI-I as a surrogate for more complicated pre- and post-questionnaires, but this study demonstrates that the PGI-I has to be used carefully or in addition to other PROMs to evaluate improvement following UI or POP surgery, because this score does not take recall bias into consideration.