Introduction

Pelvic floor dysfunction (PFD) is a broad term used to describe specific clinical conditions of the pelvic floor (PF), such as urinary incontinence (UI), pelvic organ prolapse (POP), sexual dysfunction, and functional defecation disorders [1]. The prevalence of PFD varies widely in the literature [2, 3]; however, it is known to be more frequent in females [1] and can negatively affect the woman’s quality of life, causing bother in the PF, depression, and a sedentary lifestyle [4,5,6].

For the assessment and quantification of the intensity and severity of symptoms, patient-reported outcome measures (PROMs) are often used [19]. Among the existing PROMs to assess the symptoms of PFD in women, the Pelvic Floor Distress Inventory (PFDI) is frequently used [7, 8] and recommended as grade A by the International Consultation on Incontinence (ICI) [9]. The ICI classifies a PROM as grade A if the Committee found “published data indicating that the questionnaire is valid, reliable, and responsive to change following standard psychometric testing,” without taking into account the details of how the study was conducted and the values for the measurement properties [9]. In addition, the 46 items of the PFDI [10] were abbreviated in a short version with 20 items, the PFDI-20 [6]. The purpose of both forms is to assess the distress of the symptoms of PFD through items divided into three subscales—Pelvic Organ Prolapse Distress Inventory (POPDI), Colorectal-Anal Distress Inventory (CRADI), and Urinary Distress Inventory (UDI)—which assess POP, anorectal, and urinary symptoms, respectively [6, 10].

The PFDI and PFDI-20 have been translated and validated into several languages [11,12,13,14,15]. Both versions have also been used in a large number of studies [7, 8] and in clinical practice to identify the symptoms of PFD and the intensity of distress caused by these symptoms. However, it is necessary to evaluate the measurement properties of the two versions to determine whether they are suitable for the clinical and scientific context. Furthermore, the classification of a PROM proposed by the ICI is simplistic and does not analyze the study methods or values of the measurement properties. To our knowledge, no systematic reviews of the measurement properties of the PFDI or PFDI-20 have been published. Thus, the aim of the current study was to investigate whether the measurement properties of the PFDI and PFDI-20 were confirmed in previous studies and are suitable as criteria for good measurement properties, according to the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN).

Materials and methods

This systematic review (PROSPERO ID CRD42020157083) followed the guidelines of the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) [16]. The search for studies was carried out by two independent reviewers in the PubMed, SCOPUS, Web of Science, ScienceDirect, and CINAHL databases. We also conducted a search in Google Scholar to identify studies not indexed in databases. Instrument names combined with their acronyms were used in a search filter for measurement properties adapted for each database (Appendix A) and recommended by COSMIN [17]. The search was carried out in August 2020, with articles evaluated without language restrictions and publication date. The reference lists of all included studies were verified to identify studies not found in the database searches.

In this systematic review, we followed the COSMIN methodology, which is based on the definitions of measurement properties: content validity (degree to which the content of a PROM is an adequate reflection of the construct to be measured); structural validity (degree to which the scores of a PROM are an adequate reflection of the dimensionality of the construct to be measured); criterion validity (degree to which a PROM score is an adequate reflection of a “gold standard”); cross-cultural validity (degree to which performance of items in a translated or culturally adapted PROM is an adequate reflection of the performance of items in the original PROM version); hypothesis testing for construct validity (degree to which the scores of a PROM are consistent with hypotheses based on the assumption that the PROM validly measures the construct to be measured), divided into convergent validity (correlations must be related to other measures), discriminative validity or comparison between known groups (comparison between groups that are different in the construct), and divergent validity (correlations should not be related to other measures); reliability (degree to which a measure is free from measurement errors: test-retest, intra- and inter-evaluators); internal consistency (degree of interrelation between items); measurement errors (systematic and random error of a patient’s score that is not attributed to real changes in the construct to be measured); responsiveness (PROM ability to detect changes over time in the construct to be measured) [18].

Original studies that reported one or more measurement properties of the PFDI and PFDI-20 and their subscales were included. Studies that were available only as event summaries or proceedings and that assessed the reliability of PFDI and PFDI-20 paper versions versus telephone/computer administration were excluded. According to the COSMIN guideline, these types of reliability studies should be ignored as they do not provide enough information about this PROM measurement property [19]. Titles, abstracts, and full texts were reviewed by two independent researchers (G.T.A., T.S.H.). In case of disagreements between the researchers, a third (J.F.V.) was consulted.

The extraction of data on the measurement properties and the evaluation of the methodological quality of the studies were carried out by two independent researchers (G.T.A., T.S.H.). The type of measurement property, results of the measurement property, objective of the study, sample size, type of PFD included, country where the study was carried out, and version of the PFDI used were extracted. For analysis, recording, and storage of results, the reference manager EndNote Web and a spreadsheet in the Microsoft Excel program were used.

Data analysis was performed in three stages. In the first step, the methodological quality of the included studies was rated using the 4-point scoring system on the COSMIN checklist [19]. The methodological aspects of the studies and the statistical methods used for each measurement property were rated as “very good,” “adequate,” “doubtful,” or “inadequate.” To rate the methodological quality of the studies according to the sample number, the following criteria were used as a standard: the study was rated as “very good” if the sample number was seven times the number of items and ≥ 100 individuals; “adequate” if the sample number was five times the number of items and ≥ 100 individuals, or six times the number of items, but < 100 individuals; “doubtful” if the sample number was equal to five times the number of items and < 100 individuals; “inappropriate” if the sample number was less than five times the number of items. At the end of this stage, the methodological quality of each measured property was summarized by study. For content validity, separate aspects were evaluated regarding relevance, comprehensiveness, and comprehensibility in the three stages [19].

In the second step, each measurement property in each study was rated as “sufficient,” “insufficient,” or “indeterminate,” according to criteria for good measurement properties of the COSMIN guideline for systematic reviews of PROMs [19]. These ratings were summarized qualitatively to determine the overall rating of the measurement property for the PFDI and PFDI-20. If all studies indicated a “sufficient,” “insufficient,” or “indeterminate” rating for a specific measurement property, the overall rating of that measurement property was rated accordingly. If there were inconsistencies between studies on differences in methodological quality, populations, etc., explanations were provided. The explanations were discussed until a consensus was reached on the overall rating of the measurement property. If no explanation was found, the overall rating was considered “inconsistent.”

In the third step, the overall rating of evidence by measurement property was complemented by a level of quality of evidence, using the modified Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach of the COSMIN methodology [19]. This approach takes into account the quality of the study, the indirectness of the evidence, the inconsistency of the results, and the imprecision of the evidence (number of studies and sample size). The overall quality of the evidence was rated as “high,” “moderate,” “low,” or “very low.” This classification considers “high” quality of evidence when there is a lot of confidence that “the true measurement property lies close to that of the estimate of the measurement property,” “moderate” quality when “the true measurement property is likely to be close to the estimate of the measurement property, but there is a possibility that it is substantially different,” “low” quality when “the true measurement property may be substantially different from the estimate of the measurement property,” and “very low” quality of evidence when “the true measurement property is likely to be substantially different from the estimate of the measurement property” [19]. The measurement properties that were rated as “indeterminate” in the previous step did not receive a rating in this third step, as there was no evidence to rate.

Results

Through the database searches, 2857 studies were found. After removing duplicates (n = 793) and reading titles and abstracts, 46 studies were read in full. Of these, 15 were excluded for not evaluating measurement properties (n = 11), for analyzing only a subscale of the PFDI-20 (n = 2), and for evaluating reliability through paper versus internet or telephone (n = 2). Of the 31 studies selected after applying the eligibility criteria, 1 study was added through the reference lists, with 32 studies included at the end (Fig. 1).

Fig. 1
figure 1

Flowchart for selecting studies based on PRISMA

Of the 32 studies included, 7 evaluated the measurement properties of the PFDI and 25 evaluated the PFDI-20, totaling the participation of 8217 women. The majority of studies were conducted in the USA (n = 9; 28.13%) and with women with PFD (n = 16; 50%). The characteristics of the included studies are shown in Chart 1. Except for cross-cultural validity, all the 32 studies included reported at least one measurement property.

Chart 1 Characteristics of the included studies (n = 32)

Content validity

Content validity was assessed by 21 studies, of which 2 were on the PFDI [10, 11, 13, 20] and 19 on the PFDI-20 [6, 12, 14, 15, 21,22,23,24,25,26,27,28,29,30,31,32,33] (Appendix B). Content validity was not assessed in one study on the PFDI [10] and three studies on the PFDI-20 [6, 30, 32] because it was not clear which aspects were assessed, relevance, comprehensiveness, or comprehensibility.

The overall methodological quality of all studies [11,12,13,14,15, 20,21,22,23,24,25,26,27,28,29, 31, 33] was rated as “inadequate” because how the patients were asked about the relevance, comprehensiveness, and comprehensibility and how the professionals assessed the relevance and comprehensiveness are not clear or not presented in enough detail. Because of this, the rating of the content validity of these studies was assessed as “insufficient.”

Structural validity

No studies evaluated the structural validity of the PFDI, and only one study reported this measurement property for the PFDI-20 [15]. The study by Ma et al. [15] used confirmatory factor analysis (CFA) and found five factors that explain 69.55% of the cumulative variance: anal and colorectal distress, direct feelings of POP and symptoms of irritation or obstruction of the lower urinary tract, several types of UI, external force to defecate, and symptoms of a rectocele.

The methodological quality of the study by Ma et al. [15] was assessed as “adequate” because of the small sample size (less than seven times the number of items on the PFDI-20). The structural validity rating of the PFDI-20 was considered “indeterminate” since the unidimensionality of each subscale was not confirmed because three factors were not found for the PFDI-20 or a single factor that assesses the distress of the symptoms of PFD.

Internal consistency

Internal consistency was reported by 25 studies, of which 5 were on the PFDI with “very good” [20, 36] and “inadequate” [10, 34, 35] methodological quality and 20 on the PFDI-20 with “very good” [11, 12, 15, 23,24,25, 28, 32, 37, 38], “doubtful” [26, 30, 31, 33], and “inadequate” [13, 14, 21, 22, 25, 29] methodological quality (Table 1). The “doubtful” methodological quality of the studies occurred because of the low sample number equal to five times the number of items and fewer than 100 individuals. The studies were assessed as “inadequate” quality because of the low sample number less than five times the number of items and because they did not report the Cronbach’s α value for each subscale.

Table 1 Cronbach’s α values for internal consistency of PFDI and PFDI-20 and subscales

The internal consistency of all studies on the PFDI and PFDI-20 was rated as “indeterminate” because of their factorial structure being unclear, contradicting the unidimensionality of each of the three subscales stipulated by the instrument’s developers. This is because only one study [15] reported the structural validity of the PFDI-20 and found 5 factors through the CFA: anal and colorectal distress (factor 1); direct POP feelings and symptoms of irritation or obstruction of the lower urinary tract (factor 2); various types of UI (factor 3); external force to defecate (factor 4); symptoms of a rectocele (factor 5). Thus, for COSMIN, the methodological quality of the studies that analyzed the internal consistency must be rated as “doubtful,” and no conclusions can be made.

Test-retest reliability

Of the total studies selected, 4 on the PFDI reported test-retest reliability as “doubtful” [34] and “inadequate” [10, 20, 36] methodological quality and 17 on the PFDI-20 as “doubtful” [13,14,15, 22, 25,26,27,28,29, 32, 38] and “inadequate” [6, 11, 12, 24, 30, 31] methodological quality (Table 2). The “doubtful” quality occurred because the low sample number, or the time interval between the test and the retest was doubtful, or it was not clear whether the patients were stable between measurements, or they did not calculate the intraclass correlation coefficient (ICC) and because the conditions—environment or administration—of the test are different from the retest. The “inadequate” quality occurred because the time interval between the test and retest was considered inappropriate.

Table 2 Test-retest reliability of PFDI and PFDI-20

Test-retest reliability of the PFDI was rated in all studies as “sufficient.” This measurement property was rated as sufficient” in the majority of the studies on the PFDI-20, and only one study [38] was rated as “indeterminate” for not having used the ICC in the analysis.

Measurement errors

Five studies, all on the PFDI-20, reported measurement errors as “doubtful” [14, 15] and “inadequate” [11, 12, 30] methodological quality (Appendix C) because of the time interval between measurements being greater than or less than 2 weeks and the sample size being equal to or less than five times the number of items on the PFDI-20. All the studies used appropriate methods to assess measurement errors, such as standard error of measurement (SEM), limits of agreement (LoA), and smallest detectable change (SDC).

Regarding the rating of measurement errors by study, studies were rated with measurement errors “sufficient” [14], “indeterminate” [11, 12], and “insufficient” [15, 30]. Two studies had measurement errors rated as “indeterminate” because they did not present the values of minimal important change (MIC), and two were rated as “insufficient” because they presented an MIC less than the LoA or SDC.

Criterion validity

As there is no gold standard for assessing the distress of all PFD symptoms assessed by the PFDI-20 (POP, anorectal, and urinary), only two studies [6, 39] reported criterion validity for the PFDI-20 with “very good” methodological quality and “sufficient” criterion validity. According to COSMIN [19], the long version of an abbreviated PROM is considered the gold standard when it is not available. Pearson’s correlation coefficient was used to assess the criterion validity of the PFDI-20 subscales in the studies by Barber et al. [6] (POPDI-6 r = 0.92; CRADI-8 r = 0.93; UDI-6 r = 0.86) and Barber et al. [39] (POPDI-6 r = 0.90; CRADI-8 r = 0.93; UDI-6 r = 0.88).

Hypothesis testing (convergent validity)

Convergent validity was reported by 15 studies, including 3 studies on the PFDI with “very good” [20] and “inadequate” [10, 36] methodological quality and 12 studies on PFDI-20 with “very good” [14, 23, 32, 37], “doubtful” [24, 27, 28, 31], and “inadequate” methodological quality [13, 15, 22, 29] (Appendix D). The “doubtful” methodological quality rating was given because of the low sample size (equal to 5 times the number of items). The studies were evaluated with “inadequate” methodological quality due to the very small sample size (less than 5 times the number of items).

All studies performed correlations as the predominant statistical method. Instruments to assess quality of life, whether or not related to PFD, such as the SF-12 Health Survey (SF-12), the Short Form Health Survey (SF-36), the Visual Analogue Scale (VAS), the Pelvic Floor Impact Questionnaire (PFIQ-7), and the Urinary Incontinence Quality of Life Scale (I-QoL), were predominantly used [13,14,15, 20, 22, 24, 27,28,29, 32, 36]. The majority of studies quantified the degree of POP objectively through the Pelvic Organ Prolapse Quantification (POP-Q) by most studies [10, 20, 23, 27, 29, 31, 36, 37].

The convergent validity rating for the PFDI was “sufficient” [20], “indeterminate” [39], and “insufficient” [36]. The “indeterminate” rating occurred because there was no clearly defined hypothesis and “insufficient” because the results were not in line with the hypotheses. For the PFDI-20, the convergent validity rating for each study was “sufficient” [14, 22,23,24, 28, 29, 31, 32] and “insufficient” [13, 15, 27, 37]. The rating was “insufficient” as < 75% of the results were in accordance with the hypotheses.

Hypothesis testing (divergent validity)

Divergent validity was reported by all six studies on the PFDI-20 with “very good” [32], “doubtful” [28, 31], and “inadequate” [13, 22, 29] methodological quality (Appendix E). Studies were evaluated as “doubtful” and “inadequate” quality because of the low (equal to 5 times the number of items) and very low (less than 5 times the number of items) sample sizes, respectively.

Regarding the rating of divergent validity, 5 studies were rated as “sufficient” [22, 28, 29, 31, 32] and one as “insufficient” [13]. The study by Mattsson et al. [13] was rated as “insufficient” divergent validity as < 75% of the results were in accordance with the hypotheses.

Hypothesis testing (comparison between known groups)

The comparison between known groups or discriminative validity was reported by one study on the PFDI with “very good” [20] methodological quality and by eight studies on the PFDI-20 with “very good” [23], “doubtful” [11, 12, 15, 30, 31, 37], and “inadequate” [29] methodological quality (Appendix F). The studies were evaluated as “doubtful” quality because of the small sample size, because they used incorrect statistical tests, or because they did not present values for each subscale of the PFDI-20. The study with “inadequate” methodological quality was assessed as the sample size was considered too small and because of the use of an inadequate statistical test. All studies on the PFDI and PFDI-20 included women with and without POP or with different stages of POP.

The rating of the comparison of known groups of the PFDI was “sufficient” and of the PFDI-20 was “sufficient” [11, 12, 15, 23, 29, 31] and “insufficient” [30, 37]. The “insufficient” rating was due to < 75% of the results being in accordance with the hypotheses.

Responsiveness

Seventeen studies, 2 on the PFDI with “very good” [41] and “inadequate” [40] methodological quality and 15 on the PFDI-20 with “very good” [6, 26], “doubtful” [11, 12, 14, 15, 23, 27, 30, 37, 39, 42, 44], and “inadequate” [21, 43] methodological quality evaluated responsiveness (Appendix G). The “doubtful” quality of the studies was due to the small sample size, the use of an inappropriate method to test the hypotheses, and the lack of an adequate description of the intervention. The studies were rated as “inadequate” because of the very small sample size or the use of inappropriate statistical methods. Surgery was the most commonly used intervention by studies [6, 11, 12, 14, 15, 21, 23, 26, 27, 30, 37, 39,40,41,42, 44].

The PFDI responsiveness rating was “sufficient” [40] and “insufficient” [41]. Seven studies on the PFDI-20 received “sufficient” ratings of responsiveness [6, 14, 23, 26, 27, 37, 44], six had an “indeterminate” rating [11, 12, 15, 21, 30, 42], and two were rated as “insufficient” [39, 43]. Responsiveness was rated as “indeterminate” because the studies did not present values for the subscales or presented only values of statistical significance and “insufficient” because the results were not in line with the hypotheses.

Data synthesis

Table 3 presents the overall rating of the measurement properties and the quality of the evidence of the PFDI and PFDI-20. The general content validity rating of the PFDI and PFDI-20 was “insufficient” and the quality of the evidence “very low” due to the existence of only studies with inadequate methodological quality.

Table 3 Classification and quality of evidence of the measurement properties of PFDI and PFDI-20

The structural validity of the PFDI-20 and the internal consistency of the PFDI and PFDI-20 were rated as “indeterminate” because the PROM subscales do not have proven one-dimensionality. Thus, the quality of the evidence was not assessed.

The test-retest reliability of the PFDI was rated as “sufficient” because all studies were assessed as “sufficient” and with an ICC ≥ 0.7. However, the quality of the evidence of the PFDI was considered “moderate” due to the fact that all studies were evaluated as having “doubtful” or “inadequate” methodological quality. The test-retest reliability of the PFDI-20 was also rated as “sufficient” with “moderate” quality of evidence due to the existence of several studies of “doubtful” or “inadequate” quality.

For the PFDI-20, measurement errors were rated as “inconsistent” because the majority of the existing studies were rated as “indeterminate” or “insufficient.” The quality of the evidence was assessed as “moderate” because all studies were of “doubtful” or “inadequate” quality and because the majority of studies had an “indeterminate” or “insufficient” rating.

The criterion validity and construct validity for hypothesis testing of the PFDI-20 were rated as “sufficient” and with high evidence. However, the construct validity for PFDI hypothesis testing was rated as “insufficient” and the quality of the evidence was assessed as high. The “inconsistent” rating occurred because 50% of the studies were rated as “indeterminate” or “insufficient” and 50% as “sufficient.” This criterion was applied as if < 75% of the results are rated as “insufficient,” the rating by measurement property is also “insufficient.”

The responsiveness of the PFDI was rated as “insufficient” because 50% of the studies were rated as “sufficient” and 50% as “insufficient.” This is also due to the small number of studies that evaluated this measurement property for the PFDI. The quality of the evidence of responsiveness of the PFDI was assessed as moderate based on the disagreement between the ratings of the studies, as only two studies are available, one rated as “sufficient” and the other as “insufficient.” Regarding the responsiveness of the PFDI-20, this measurement property was rated as “inconsistent” and with a high quality of evidence. The rating was “inconsistent” because of the considerable number of studies with “indeterminate” and “sufficient” ratings.

Discussion

For the PFDI, only the hypothesis testing presented a high quality of evidence, while the PFDI-20 had a high quality of evidence for criterion validity, hypothesis testing, and responsiveness. For the structural validity and internal consistency of both PROMs, it was not possible to determine the quality of the evidence. The content validity of both PROMs had very low quality of evidence. No studies on the PFDI and PFDI-20 assessed cross-cultural validity. This demonstrates a serious problem in the validation process of these PROMs, reflecting the need for a new process to evaluate the measurement properties of the instruments. It is possible that a process of validation and evaluation of the measurement properties of the PROM, mainly evaluating relevance, comprehensiveness, and comprehensibility of the content validity and following the COSMIN methodology, would be most appropriate.

Content validity is considered the most important measurement property of a PROM because, if the PROM construct is unclear, the evidence for the remaining measurement properties is not valuable [18, 19]. This measurement property is evaluated by asking the patient about relevance, comprehensiveness, and comprehensibility, and professionals about relevance and comprehensiveness [18]. Thus, the content validity of the PFDI and PFDI-20 would be an adequate reflection of the construct to be measured if relevance, comprehensiveness, and comprehensibility were evaluated. In addition, whether the items of both PROMs are relevant for the assessment of the distress of the symptoms of PFD, comprehensive for the assessment of this distress, and understandable to the population could be assessed. In this systematic review, 1 study on the PFDI and 16 studies on the PFDI-20 almost exclusively assessed comprehensibility by patients; only 1 study assessed the comprehensiveness of the PFDI-20 and 3 studies assessed the relevance of the PFDI-20 to patients. Regarding the evaluation by professionals, 16 studies assessed the relevance and comprehensiveness of the PFDI-20. Overall, the studies are not clear and do not present details on how patients and professionals are questioned on these aspects of content validity. This represents a major flaw in the validation process for both instruments and reflects the quality of the remaining measurement properties.

In the current systematic review, the methodological quality for structural validity was not assessed by any studies on the PFDI and was assessed by only one study on the PFDI-20. For the PFDI-20, the quality of evidence of internal consistency was not assessed because three factors were not found, one for each subscale, or any factors to assess the distress from PFD symptoms. In addition, in the study by Ma et al. [15], the method of factor analysis is not very clear, since it is advisable to perform exploratory factor analysis before CFA if there is no evidence about the dimensionality of an instrument [18]. The lack of studies on the structural validity of the PFDI and the failure to assess the quality of evidence on the structural validity of the PFDI-20 compromised the quality of the evidence of internal consistency. This was because the structural validity is a prerequisite for assessing internal consistency, because for each factor identified in the instrument there must be a unidimensionality value: Cronbach’s alpha [18].

The cross-cultural validity or invariance of the measure was the only measurement property not evaluated for the PFDI and PFDI-20. For the assessment of cross-cultural validity, data from two different populations are needed, in which one population completes the original version of the instrument and the other completes the culturally adapted version. For the analysis between the versions of the instrument, some statistical methods are generally used, such as the differential item functioning (DIF), which detects items that differ between subgroups of a population; factor analysis; logistic regression analysis; and techniques of item response theory [18, 19].

In addition to cross-cultural validity, criterion validity was not evaluated for the PFDI. However, two studies showed these measurement properties for the PFDI-20 [6, 39]. According to the COSMIN guideline [19], if a measure does not have a gold standard defined in the literature, one way to check the criterion validity is by comparing the long version of the PROM with the short version. In the case of the PFDI and PFDI-20, which assess different symptoms of PFD and their distress in women, there is no established gold standard. Thus, the comparison between the PFDI and PFDI-20 is useful for assessing criterion validity, being rated as “sufficient” and high quality of evidence for the PFDI-20.

Test-retest reliability is analyzed based on two measurements made with the same subject after a period of time. According to COSMIN [18, 19], some requirements are necessary for the “sufficient” rating of reliability, such as the appropriate time interval between the test and the retest, patient stability in the interim period, similar conditions for the measurements, appropriate sample number, and analysis of the ICC with its specifications for continuous results. The reliability of the PFDI and PFDI-20 were rated as “sufficient” because of the use of the ICC and values obtained above 0.70 in more than 70% of the studies, but the evidence was considered moderate because several studies had “doubtful” or “inadequate” quality. This criterion followed the only downgrade of the risk of bias by the modified GRADE. Thus, although some studies showed flaws in relation to the requirements for reliability analysis, what most defined the overall quality of the evidence from the PFDI and PFDI-20 were the results of the ICC.

Similar to reliability, measurement errors are assessed for the same subject based on two time-span measurements. For the evaluation, each study must clearly provide the calculation and value of the MIC, SEM, SDC, or LoA. The MIC is defined by COSMIN as “the smallest change in the score in the construct to be measured that patients perceive as important” and can come from different studies [18]. In addition, for a PROM to have a “sufficient” overall rating and high quality of evidence of measurement errors, it should be assessed, respectively, whether the SDC or LoA is less than the MIC and if the risk of bias, inconsistency, imprecision, and indirectness assessed by the modified GRADE are satisfactory [18]. In the case of the PFDI-20, which had an “inconsistent” overall rating of measurement errors and moderate quality of evidence, these measurement error assessment requirements were not met. The overall rating was considered “inconsistent” for the PFDI-20 measurement errors because > 75% of the studies had SDC or LoA values greater than the MIC value or did not define the MIC.

When a gold standard does not exist or is not used by studies to assess criterion validity, the hypothesis test for construct validity must be performed through the expected relationships with other measures (convergent validity), not expected with other measures (divergent validity), and/or through expected differences between relevant groups (discriminatory or known groups). For this, one of the prerequisites for evaluating the hypothesis testing is the previous formulation of hypotheses with inclusion of the directions and expected magnitudes of the correlations between the instruments or of the average differences between the groups [18]. In this systematic review, the overall rating of the hypothesis testing for construct validity of the PFDI was considered “insufficient” because < 75% of the studies were rated as “sufficient.” For the convergent validity of the PFDI, 66.67% of the studies received the “indeterminate” or “insufficient” rating because no hypothesis was defined or because the result was not in accordance with the hypothesis presented. Divergent validity of the PFDI was not evaluated by any studies, and only one study [20] assessed the PFDI as “sufficient” to compare known groups. In contrast, the PFDI-20 had a “sufficient” rating for hypothesis testing and high quality of evidence.

Responsiveness is an important measure for clinical practice because it has the ability to detect changes in the construct to be measured over time. Thus, the responsiveness of a PROM helps clinical professionals to observe improvement or worsening in a patient’s health after an intervention [18]. In addition to being evaluated in longitudinal studies, for this measurement property, four approaches are used to rate the methodological quality of the studies: (1) criterion approach; (2) construct approach—comparison with other PROMs; (3) construct approach—comparison between subgroups; (4) construct approach—pre- and post-intervention. The evaluation of approaches 1, 2, and 3 can be compared, respectively, to the evaluations of the criterion validity and the hypothesis testing for construct validity-convergent validity and known groups. However, the general responsiveness of the PFDI was rated as “insufficient” and of moderate quality of evidence due to the lack of hypotheses or because there was no agreement between the results and the hypotheses of the two existing studies [40, 41]. For the PFDI-20, responsiveness had an overall “inconsistent” rating and high quality of evidence. The rating was “inconsistent” because most studies on the PFDI-20 received a “doubtful” rating as they did not provide adequate information about the intervention and/or use an inappropriate statistical method to test the hypotheses.

This was the first systematic review study to investigate the measurement properties of two instruments frequently used in scientific research [7, 8] and recommended by the ICI as grade A for the assessment of PFD in women [9]. In clinical practice, the need for PROMs with high-quality evidence measurement properties contributes to the correct assessment of the patient’s health status. In scientific research, high-quality PROMs improves the accuracy of data analysis on the health status of a particular group or population. For researchers and health professionals in the field of urogynecology who use the PFDI or PFDI-20, this systematic review presented the quality of evidence of the measurement properties of both instruments in detail and also identified the measurement properties that need further investigation. Despite this, the failure to assess interpretability, “the degree to which qualitative significance can be attributed to quantitative scores or changes in the scores of an instrument,” can be considered a limitation of this study. Interpretability is not considered a measurement property for COSMIN because it does not assess the quality of studies related to validation and reliability [18]. However, the evaluation of interpretability is recommended in order to understand the score in different populations.

According to the results of this systematic review, the PFDI has a high quality of evidence for construct validity-hypothesis testing, moderate for test-retest reliability and responsiveness, and very low quality of evidence for content validity. The PFDI-20 demonstrates a high quality of evidence for criterion validity, construct validity-hypothesis testing and responsiveness, moderate quality for test-retest reliability and measurement errors, and very low quality of evidence for content validity. The internal consistency of the PFDI and PFDI-20 was not evaluated because of the scarcity of studies that confirmed the unidimensionality of the PFDI, the PFDI-20, or its subscales. Thus, we suggest that future studies assess the factorial structure of both PROMs to determine the dimensionality and internal consistency of their dimensions. This would help in identifying items that assess the same construct, such as POP, anorectal and urinary symptoms, or just distress from PFD symptoms. In addition, it is necessary to evaluate the analysis of the cross-cultural validity as a measure of the behavior of the translated items in relation to the original version of the instrument and, mainly, the content validity throughout its methodological process in order to improve the quality of evidence of this measurement property.