Introduction

The Pelvic Organ Prolapse Quantification (POP-Q) was introduced in 1996 as an objective and precise system for the description of pelvic floor anatomy in women with pelvic floor disorders [1]. Since its introduction, it has been adopted by multiple professional societies including the American Urogynecologic Society, the Society for Gynecologic Surgeons, and the International Continence Society. It has become a useful clinical and research tool for the longitudinal evaluation of women with prolapse and their treatment outcomes and allows accurate communication of pelvic floor anatomy among pelvic floor surgeons.

There have been several studies published in the gynecologic literature on the validation of the POP-Q system [2, 3]. These studies have demonstrated excellent intra-examiner and inter-examiner reliability and reproducibility. Yet, as originally described, the POP-Q system makes no specific recommendations for variables such as patient position, type of exam table or chair, degree and type of strain, and method by which quantitative measures should be made. Subsequent research has shown that such variables influence POP-Q results [46].

Despite its precision, reproducibility, and reliability, the POP-Q system is underutilized in clinical practice. It has been estimated to be used by 40% of pelvic floor surgeons in clinical practice and 60% for research purposes [7]. Reasons given for this low utilization rate include perceptions that the POP-Q is too time-consuming, complicated and confusing, difficult to learn, and, by some, to have no clinical relevance [7]. Thus, there has been an interest in simplifying the POP-Q system to make it more practical and efficient for clinical and research applications [8, 9].

Because of the suggestion that in an effort to save time clinicians modify the exam by estimating POP-Q points as opposed to measuring them, this study was performed with the primary objective of comparing the POP-Q stage and its individual points obtained between a standard measured POP-Q examination and an estimated technique.

Materials and methods

Between April and November 2008, 50 consecutive women presenting to the Section of Urogynecology and Pelvic Reconstructive Surgery at Cleveland Clinic Florida with a primary complaint of pelvic organ prolapse were consecutively recruited to participate in this Institutional Review Board-approved study. Inclusion criteria were a primary complaint of genital prolapse and exclusion criteria consisted of inability to consent to study enrollment or tolerate two consecutive genital examinations. After consent was obtained, subjects underwent a standard POP-Q examination with a rigid marked measuring stick (POPStixTM, Auckland, New Zealand) and a POP-Q by estimation (the “eyeball” POP-Q) in a randomized order by two successive examiners.

Prior to study enrollment, we developed the “eyeball” POP-Q technique, a novel approach to assess the individual POP-Q points by both visual estimation and palpation. In this technique, the points along the anterior and posterior vaginal walls (Aa, Ba, Ap, and Bp) are visually estimated (not measured) in 0.5-cm increments with maximal Valsalva, as originally described in the standard POP-Q [1]. Visual estimation is performed on the perineum (GH & PB). Determination of vaginal depth (total vaginal length or TVL) and apical descent (points C and D) are assessed by both visual estimation and palpation with the examiner’s dominant hand.

Four examiners including two attending physicians and two fellows were involved in this study at one clinical site. All study subjects underwent both exam techniques by two examiners (one attending and one fellow), blinded to each other’s results, during the same clinic visit. The order for exam technique was randomized by computer generation for the first examiner. The second examiner performed both examinations, but in an opposite order. The study was designed such that upon study enrollment, half of the subjects underwent an estimated exam first followed by a measured exam by the first examiner and then a measured exam followed by an estimated exam by the second examiner. The other half of the subjects underwent a measured exam followed by an estimated exam by the first examiner and an estimated exam followed by a measured one by the second examiner (Fig. 1).

Fig. 1
figure 1

Schematic of study design and analysis

Other than estimation and measurement, exams were performed in a similar, standardized manner. This included performing them immediately following voiding, in the supine lithotomy position, and with maximal strain using Valsalva effort. For Valsalva, patients were asked to take a deep breath in, hold their breath, and bear down as if they were constipated and trying to have a bowel movement. Patients confirmed that their maximal prolapse was reproduced with each examination by either palpation or visual confirmation with a hand-held mirror. The bottom half of a bivalve speculum was used to assess all internal values (TVL, C, D, Aa, Ba, Ap, and Bp) for the measured and estimated examinations, with the exception of TVL, C, and D for the estimated technique where the examiner’s dominant hand was used instead of a speculum.

Data was entered into an Excel spreadsheet following completion of data collection and then imported into Statistical Package for the Social Sciences (SPSS Inc., Chicago, IL, USA). Because this was considered a pilot study, an a priori power calculation was not performed. A post hoc power analysis based on the final sample size of 50 patients shows that assuming a 2-cm difference between measured and estimated exams as a clinically significant difference, a sample size of 50 provides 80% power (two-sided α = 0.05). A 2-cm difference has been cited as a clinically important change within the POP-Q literature [4].

Individual POP-Q values and stages were compared between the estimated and measured techniques for the same examiners and between different examiners. In order to ensure that results were not skewed by examiner recall bias, estimated results obtained prior to measured values were analyzed primarily. Secondary analysis was performed to compare values obtained by estimation following measured ones and to assess the inter-examiner reliability of estimated examinations.

The two-tailed paired t test was used to compare individual POP-Q points and the chi-square test for stage. Correlation and agreement of POP-Q points and stage were performed by Pearson’s correlation coefficient and kappa statistics, respectively. A p value of ≤ 0.05 was considered statistically significant.

Results

Patient demographics are displayed in Table 1. Forty-two percent (21/50) had prior hysterectomy. The POP-Q stages were 18% (9/50) stage 1, 38% (19/50) stage 2, 44% (22/50) stage 3, and 0% (0/50) stage 4 based on the measured technique.

Table 1 Patient demographics

Table 2 illustrates the overall POP-Q stages obtained by different examiners for the estimated and measured techniques. In 90% of subjects, the stage was similar between the measured and estimated techniques. There was no trend toward a higher or lower stage: 4% (2/50) cases had a higher stage with estimation and 6% (3/50) cases had a lower stage using estimation. In no subject did the stage differ by greater than one. Overall, there was no difference in POP-Q stage between the measured and estimated techniques (p = 0.83).

Table 2 Overall POP-Q stage obtained by estimated and measured techniques by different examiners

When comparing all individual POP-Q values, we found that 81% of values obtained by estimation and measurement were within 1 cm of each other, 2.6% of estimated values were 2 cm greater than measured values, 2.8% of estimated values were 2 cm less than measured ones, and 0.004% of estimated values had a 3-cm greater difference than measured points.

Because of the concern that measured exams could influence estimated ones, analyses of POP-Q points were also performed separately. For POP-Q exams in which estimated values were obtained prior to measured ones, there was no significant difference between measured and estimated POP-Q points for different examiners (all p > 0.05, Table 3). There was also no significant difference in estimated and measured values among exams in which measured values were obtained first (all p > 0.05).

Table 3 Estimated and measured POP-Q points by different examiners (for estimation points performed before measurement)

Agreement between the techniques was assessed with Pearson’s correlation coefficient (ρ) for integral data and kappa statistic (k) for categorical data. The value of each defines the strength of agreement. Coefficients between 0.81 and 1.00 are considered almost perfect, 0.61 and 0.80 substantial, and between 0.41 and 0.60 moderate. In this sample, there is substantial to almost perfect agreement between estimated and measured POP-Q stage and for the majority of individual POP-Q values (Table 3).

For assessment of overall POP-Q stage, good correlation was found between the two techniques with almost perfect intra-examiner (k = 0.84, p < 0.01) and substantial inter-examiner agreement (k = 0.66, p < 0.01). In terms of strength of agreement between the estimated and measured techniques, eight of nine intra-examiner and seven of nine inter-examiner correlations had “substantial or almost perfect” agreement, with the remainder (points D and TVL) having “moderate” agreement.

Secondary analysis to assess inter-examiner reliability between estimated exams showed no significant difference when comparing individual POP-Q values and substantial and almost perfect inter-examiner agreement for all points with the exception of points D and PB (Table 4). POP-Q stage obtained by estimation for different examiners did not differ significantly (p = 0.70).

Table 4 Comparison of estimated POP-Q points between examiners

Discussion

The POP-Q examination was developed to provide a precise and efficient model to allow pelvic floor clinicians and researchers to effectively communicate pelvic floor anatomy in a standardized manner. In order to avoid potential imprecision and subjectivity, the anatomic landmarks and quantitative points are explicitly described in the POP-Q system [10]. Though the POP-Q examination was developed with the objective to enhance uniformity and objectivity, modifications to the exam are routinely made in clinical practice.

In the original description of the POP-Q system, there are no specific recommendations as to how quantitative measurements should be made when performing a POP-Q exam. Thus, a variety of methods have been described for the quantitative measurements of POP-Q points including the use of marked measuring sticks, cotton swabs, ring forceps, and wooden or plastic spatulas. Technique modifications may reflect differences in the way the POP-Q is taught or modified in clinical practice to enhance convenience, use available tools and equipment, maximize time efficiency, and reduce repetitive steps during examination.

In this study, we compared POP-Q examination with a rigid marked measuring stick compared with visual estimation. While overall correlation was substantial, apical points (TVL and D) had the poorest correlation between measured and estimated exams. Our results are consistent with others who have shown that grading prolapse without measurement is least reliable in the apical segment [11]. Though differences did exist, a minority of values differed by >2 cm, the difference we and others believe to be clinically important [4, 12, 13].

The Pelvic Floor Disorders Network chose to investigate how a limited number of these technique modifications impact POP-Q measurements. Specifically, they assessed differences in POP-Q values obtained with and without a speculum, differences in perineal measurements at rest and with strain, and whether the leading edge of prolapse differed in lithotomy as opposed to standing [6]. Barber et al. [4] found a 2-cm increase in at least one POP-Q point in almost half of patients when comparing examinations in dorsal lithotomy to upright in a birthing chair. In addition to position, the effect of bladder volume has been examined [12]. This strongly suggests that modifications in exam technique alter exam results, are clinically relevant, and may affect management strategy (surgery vs. pessary) and potentially surgical approach (abdominal vs. vaginal and unaugmented vs. augmented repair).

Limitations of this study include its small sample size, limited patient population, and small number of examiners. Because the exams were performed at the same visit, there is a potential for examiner recall bias. In order to account for this, eyeball values obtained immediately preceded by a measured exam were initially not analyzed. It is also plausible that performing sequential POP-Q examinations during the same clinic visit could result in a “teaching effect,” with patients straining more for the second examiner; however, these results did not demonstrate an upstaging for the second examination. In order to better assess intra-examiner reliability, patients could have returned for a second examination by the same examiner 1 or 2 weeks later. However, this would have significantly altered the standard practice in the clinic. In addition, it is a common belief that prolapse can vary in severity from day to day depending on physical activity level, time of day, and other unknown factors; thus test–retest reliability may not have been accurate for exams performed on different days [13]. Another possible fault may be poor generalizability of our results as they are only applicable to physicians experienced in the standard POP-Q exam. Finally, no time estimation or measurement for completion of each technique was performed in this study. It would have been prudent to time each examination in order to determine if estimating values indeed does save a significant amount of time during the clinical examination of a patient with genital prolapse.

The POP-Q system is underutilized in clinical practice and in the urogynecologic literature and has been criticized for being “time-consuming” and “confusing” [7, 14]. It has been suggested that in order to save time, practicing physicians estimate POP-Q values instead of measuring them. The results of this study suggest that estimating POP-Q values provides comparable results to measuring them in physicians well versed on the standard POP-Q.