Introduction

Bias in epidemiological studies can lead to incorrect or inconsistent conclusions. A widely accepted bias classification recognizes three main types: selection bias, information bias, and confounding bias [1]. Correction during data analysis is only possible for the latter type, and thus, identification and avoidance of selection bias and information bias in the scientific research are crucial.

Recall bias is a subtype of information bias which commonly arises in retrospective studies but may also occur in prospective cohort studies and even randomized controlled trials [2]. This type of bias alludes on the fact that study participants recall information either inaccurately or incompletely. If distributed unevenly across study groups, it can affect the study’s internal validity [2,3,4,5]. The distribution of recall errors determines the direction of bias [4]. Factors known to affect the reliability of recalled information include the state of memory [6,7,8], duration of the retention interval [9, 10], patient demographics [11], and the occurrence of events [2, 4].

Nowadays, self-reported pain intensity is used as an outcome measure in research more often than before [11]. Given the importance of self-reported pain scores in clinical research, its reliability is critical. Literature on the influence of recall bias on the assessment of pain scores in clinical studies is limited and contradictory [8, 12,13,14,15,16,17,18,19]. Currently, it is unclear if recall error just results in inaccurate measurements or may also lead to a significantly altered outcome of retrospective studies on pain.

The main objective of the present study was to determine the influence of recall bias in surgical studies having pain intensity as the primary outcome. We compared retrospectively collected pre-operative pain scores with prospectively collected pain scores to estimate the risk of recall error, recall bias, and erroneous conclusions due to recall bias.

Methods

Setting

The analysis was performed at Máxima Medical Centre (MMC), a teaching hospital in Veldhoven/Eindhoven, The Netherlands. In recent years, a sub-department of general surgery (SolviMáx) has specialized on the treatment of patients with chronic abdominal wall pain and groin pain syndromes. The number of evaluated patients has expanded over the years from around 250 in 2012 to more than 1200 in 2021. The present study did not require permission from a medical ethics committee, since it involved evaluation of previously collected anonymous data.

Study design

Study eligibility criteria

We included data from all retrospective MMC cohort studies reporting results of surgical interventions for abdominal wall pain or groin pain up to 2015. A study was considered eligible when pre-operative self-reported pain scores were collected retrospectively using questionnaires or structured interviews (recalled data). These studies used retrospectively obtained pain scores, because prospective pain scores were often missing from the patients’ electronic hospital files. The original study databases were used for analysis. Studies that only used prospective data were excluded.

Eligibility criteria for participants

A subset of patients of the included studies was used for the current study. Only patients operated for chronic abdominal wall or groin pain in MMC were eligible. For individual patients, both pre-operative and post-operative prospectively registered self-reported pain scores had to be retrievable from routine electronic patient records. We thereby excluded patients whose prospective pain scores were missing. Pre-operative pain scores had to be collected retrospectively after surgery using questionnaires or structured interviews, as part of the study protocol. Furthermore, the prospectively and retrospectively applied pain scales had to be identical. Assembly of the treatment outcome (success or failure, Table 1) had to be reproducible from the documented post-operative and pre-operative pain scores. If patients participated in multiple studies, only data from the first study were used. Losses to follow-up were excluded.

Table 1 Definitions of severe pain as defined by popular pain scores and its relation with outcome following surgical intervention

Data collection process

Retrospectively obtained pre-operative pain scores were extracted from the original study databases and were considered as ‘potentially biased scores’. Pre-operative and post-operative prospectively obtained pain scores were extracted from electronic patient records and were considered as ‘unbiased scores’. Patient characteristics, type of pain treatment, and effectiveness of pain surgery were collected from the original study databases. For the purpose of the present analysis, the original study databases were combined into a new database.

Pain scales

Three different pain scales were used in the selected studies. The numerical rating scale (NRS) instructs individuals to score pain on a 0 (no pain) to 10 (unbearable pain) scale. It is a commonly used one-dimensional pain scale that is easy to use for both clinical and research purposes [20]. The visual analog scale (VAS) uses a horizontal line of 100 mm in length, the left end point (0 mm) representing absence of pain, and the right end point (100 mm) indicating unbearable pain [21, 22]. The patient is asked to place a mark on the line that corresponds to the intensity of the experienced pain. The verbal rating scale (VRS) consists of a five-point [1,2,3,4,5] categorical Likert-like scale that uses commonly used words to describe pain (Fig. 1). Patients are asked to choose the words that best describe their pain [23,24,25].

Fig. 1
figure 1

Various pain scales that were used in the present analysis

Outcome definitions

The primary outcome was the magnitude and direction of recall bias. Recall bias was defined as a systematic difference in overall treatment effect (i.e. study conclusion) between analyses using retrospective data and analyses using prospective data. Definitions of success or failure after a surgical intervention are described in Table 1. A recall error was defined as a discrepancy between prospectively and retrospectively obtained, pre-operative pain scores. Recall misclassification was defined as a recall error leading to misclassification of treatment outcome. A negative recall misclassification indicates that treatment success was falsely classified as failure based on retrospective scores, while prospective scores indicated a success. Conversely, a positive recall misclassification indicated that treatment failure was misclassified as treatment success.

Stepwise approach

A stepwise approach was used. First, accuracy of retrospectively obtained self-reported pre-operative pain scores was assessed by comparing these values with prospective pre-operative pain scores (recall error) within studies. Second, individual treatment outcomes were dichotomized as success or failure. Treatment outcome was classified as success or failure using the retrospective or prospective pre-operative scores for all individual patients, as compared to post-operative pain scores (Table 1).

Misclassification of the treatment outcome due to the use of retrospective pain scores in individual patients was identified. The prevalence of recall misclassification within studies was calculated as the proportion of patients with misclassification of the treatment outcome. This was performed for positive and negative misclassifications, both separate as well as together. In addition, the net direction of the misclassification was presented as the difference in proportion of negative recall misclassification and positive recall misclassification. Third, a meta-analysis was performed to investigate the difference in risk of misclassification between failures and successes. Significant differences point at factors leading to differential misclassification, resulting in an actual recall bias. Finally, differences in recall bias between different pain scales (NRS, VAS, and VRS) were analyzed.

Statistical methods

Data were analyzed using IBM SPSS Statistics 22 software (SPSS Inc., Chicago, Illinois, United States). Mean ± SD or median [interquartile range; IQR] of the prospective and retrospective pre-operative pain scores was calculated per study, as appropriate. Recall errors were assessed by comparing means (or medians) of these pain scores within studies. A paired Student’s t test (normal distribution) or Wilcoxon signed-rank test (non-normal distributed pain scores) was used to test statistical differences. Statistical significance was accepted at a two-sided p value of ≤ 0.05 and confirmed the presence of recall errors within studies. In addition, the absolute intraclass correlation coefficient (ICC) was calculated per study using the two-way mixed model.

To assess the influence of treatment effect on the risk of recall bias due to retrospective relative to prospective collection of pain scores, odds ratios (ORs) and corresponding 95% confidence intervals (95%CI) were calculated using Review Manager version 5.3 (The Cochrane Collaboration, London, UK). The number of negative and positive recall misclassifications and the total number of included successes and failures were entered, so that an OR > 1 points a higher risk of bias in failures than in successes. Hence, an OR > 1 at a bias toward a more positive treatment result with recalled pain scores than with prospective pain scores. ORs and 95% CIs were depicted in a forest plot. A subgroup analysis was performed per pain scale. The random-effects model was used for pooling the results. Statistical heterogeneity of studies regarding the recall bias was evaluated by Chi-square test and calculation of the inconsistency (I2).

To explore the presence of selection bias in our analysis, we compared baseline characteristics of included and excluded patients. Bivariate and continuous data were tested using the Chi-square test and independent t test, respectively.

Results

Studies and participants

Seven studies on surgical treatment of patients with chronic abdominal wall or groin pain fulfilled the inclusion criteria [23,24,25,26,27,28,29]. Patient recruitment for these studies occurred between December 2015 and August 2000 (Table 2). Four patients who did not undergo an intervention were excluded [26]. Sixty-six overlapping patients were excluded, as well. Prospective, retrospective, or post-operative pain scores were missing in 291 patients, leading to analysis of 313 patients having complete pain data sets were analyzed in the present study (Fig. 2). Median patient follow-up was 21 months [IQR 12–30].

Table 2 Characteristics of reports (n = 7) that were analyzed in the present study
Fig. 2
figure 2

Flowchart of the selection process. *Data were considered insufficient if the prospectively obtained, retrospectively obtained or post-operative pain score was missing

Main results

Is a recall error present in studies?

Recall errors were present in all seven studies, but statistical significant differences between prospective and retrospective pain scores were found in only four studies (Table 3). Agreement between pre-operative prospective and retrospective pain scores within patients as expressed by ICC was fair in three studies, whereas a moderate agreement was found in the four remaining studies.

Table 3 Recall errors and recall misclassification in studies reporting on pain attenuation following (surgical) interventions

What is the actual misclassification of recall and is it differential?

The overall prevalence of recall misclassification was 13.7% [95% CI 10.3–18.0, range 3.3–36.4%] (Table 3). Figure 3 represents the percentage of recall misclassifications per study. The net amount of recall misclassification varied considerably (ranging from 1 to 33%) but all directed toward more positive misclassification if using retrospective pain scores.

Fig. 3
figure 3

Number of patients with positive and negative recall misclassifications by study. *Negative values indicate negative recall misclassification (a shift from success to failure by the recall error). †Positive values indicate positive recall misclassification [shift from failure group (determined by prospective pain score) to the successful group (using retrospective pain score)]

In general, patients failing treatment tended to recall pre-operative NRS and VRS pain scores as significantly higher than they actually were as indicated by the prospectively obtained pain scores (NRS prospective median 7.5 [IQR 6.9–8.0] vs. recall median 8.0 [IQR 7.4–9.0]; VRS prospective median 4.0 [IQR 3.0–4.0] vs. VRS recall median 4.0 [IQR 4.0–4.5]). In contrast, the reverse was found regarding the VAS pain scale (VAS prospective mean 70.7 ± 13.9 vs. VAS recall mean 61.8 ± 12.9).

On the other hand, patients with a successful treatment outcome recorded lower pre-operative pain scores if assessed retrospectively (VAS prospective mean 71.1 ± 15.4 vs. VAS recall mean 55.8 ± 15.4; VRS prospective median 5.0 [IQR 4.0–5.0] vs. VRS recall median 4.0 [IQR 4.0–5.0]), with exception of the NRS pain scale (NRS prospective median 7.5 [IQR 7.0–8.0] vs. recall median 8.0 [IQR 7.5–9.0]).

What is the actual recall bias in studies?

ORs of recall misclassification in studies are presented in Fig. 4. An OR > 1 indicates more positive recall misclassification (i.e., patients failing treatment if based on the prospective pre-operative pain score but having a successful outcome if based on the retrospective pre-operative pain score). On the contrary, an OR < 1 demonstrates more negative recall misclassification (i.e., patients successfully treated using prospective pre-operative pain score but having an unsuccessful outcome based on retrospective pre-operative pain score). ORs varied considerably among the studies. The overall OR of 2.4 [95% CI 1.1–4.8] was significant, indicating predominance of positive misclassification over negative misclassification among studies. Therefore, an overall actual recall bias was present due to differential misclassification in successes and failures.

Fig. 4
figure 4

Forest plot of the pooled odds ratios of the recall bias by pain score. Neg misclass negative recall misclassification, indicating a shift from the failure group (determined by prospective pain score) to the successful group (using retrospective pain score); Pos misclass positive recall misclassification, indicating a shift from the successful group (based on the prospective pain score) to the failure group (as determined by the retrospective pain scores). Events are the number of misclassified cases if retrospective pre-operative pain scores were used

Heterogeneity as determined by the Chi-square test was absent (p = 0.78). The I2 was 0%, indicating no important inconsistencies between different studies (Fig. 4).

Does recall bias differ per pain score?

The prevalence of recall misclassification differed per type of pain score. Prevalence of recall misclassification in NRS was 6.3% [95% CI 3.5–10.8], VAS 26.0% [95 %CI 17.3–37.2], and VRS 24.5% [95% CI 14.5–38.2]. The OR of the recall misclassification also varied per pain score (Fig. 4) being 2.0 for NRS, [95% CI 0.6–6.7], 3.6 for VAS [95% CI 1.1–11.6], and 1.6 for the VRS pain scale [95% CI 0.4–6.3].

Selection bias

Characteristics of the population with incomplete data sets (n = 255) were similar to the population with sufficient data (n = 313, Table 4) reducing the likelihood of selection bias in the present study.

Table 4 Baseline characteristics of excluded (missing data sets) and included patients (complete sets of pain scores). Data are presented as means ± standard deviation or ratios

Discussion and conclusions

The present study demonstrates that retrospectively collected pain scores of studies on efficacy of surgery for chronic groin pain result in erroneous measurement of pain intensities. It shows that misclassification due to recall errors affect both patients with successful surgery and patients with unsuccessful surgery, with an overall prevalence of 13.7%. Positive recall misclassification is more likely to occur than negative recall misclassification with an overall pooled OR of 2.4 [95% CI 1.2–4.8]. Patients with an unsuccessful outcome recalled their pre-operative pain scores as being higher than they actually were as indicated by pre-operatively obtained pain scores. Conversely, patients with successful surgery demonstrated lower pain intensities when recalled. It may be concluded that using recalled pain scores has a significant impact on the measurement of surgical outcomes of patients suffering from abdominal wall or groin pain, depending upon the success rate. Hence, recall bias does indeed exist in this patient population.

The present study demonstrated significant recall bias if relying on retrospectively acquired pain scores. A schematic version illustrating how the present results should be interpreted in the context of other studies using recalled pain scores is depicted in Fig. 5. In the first example, a hypothetical retrospective study is performed using recalled pre-intervention pain scores. The hypothetical study included 100 patients and assumes that the intervention is successful in 80 patients (80%). As demonstrated by the present study, recall misclassification affected 13.7% of patients (Table 3). For the purpose of clarity, let us assume that in 15% of the patients, treatment effect is misclassified due to recall. Recall misclassification was about twice more likely in failures than in successfully treated patients (OR 2.35; Fig. 4). As there are fewer failures (n = 20) than successes (n = 80), the absolute number of misclassified patients in the success group is higher. As a consequence, 5 of the 15 hypothetical misclassified patients were actually failures (based on prospective pre-intervention pain scores). The other 10 misclassified patients were actually successes. The net number of misclassifications is 5 patients (10 minus 5). Following this line of thought, the percentage of successfully treated patients decreased from 80 to 75% and the number of failures increased by 5%.

Fig. 5
figure 5

Recall bias in retrospective studies. Recall errors are more present in the actual failures leading to disproportional numbers of recall misclassifications in failures and successes. Due to the disproportional numbers of misclassifications, recall bias occurs resulting in an overestimation of beneficial effect size

A second example is illustrated in Fig. 5. Since the number of successes is now lower, the absolute number of misclassified patients in the successful group is lower. This leads to a 5% success overestimation in this hypothetical example if retrospective pre-intervention data are used. Using a similar calculation based on true data from the present study indicated that the net recall bias is nullified if the success rate is 67% (Fig. 6). As a consequence, the net direction will go toward an underestimation of the treatment effect in highly successful treatment (i.e., > 67%). Conversely, it will go toward an overestimation of the treatment effect in a less successful treatment (< 67%).

Fig. 6
figure 6

Percentage of false estimation, based on the retrospective pain scores in relation to the actual success rate, as based on prospective pain scores

Previous studies on total knee arthroplasty [17], total hip replacement [16], or treatment for lower back pain [19] reported significantly higher pain levels if using recalled data. Additional literature on recall bias also confirms our finding that pain is often remembered as more intense by patients suffering from pain after treatment whereas pain intensity is underestimated after success [12, 30,31,32]. Others suggested that errors in recalling pain intensity are generally non-differential [8, 10, 13, 15, 33].

Most researchers would argue that current state of mood influences pain recollection [7,8,9, 11, 12, 32, 34, 35]. A clear example illustrating this theory is the recall of pain intensity in a postnatal stage [36]. Women who just gave birth underrate previously experienced pain during labor due to an overwhelming feeling of happiness caused by carrying their healthy newborn. In other words, patients become accustomed to improvements in their condition, a term that is referred to as ‘satisfaction treadmill’ [6, 37]. Results of the present study also demonstrate that pain-free patients do (probably unintentionally) underestimate their pre-operative pain, possibly as a result of the positive emotions experienced during recall. A similar theory may, vice versa, hold true for a failure group. Their negative emotions will modulate memory processing and, consequently, recall of pain in the past [35]. It may be concluded that recall pain intensities are likely congruent with pain, emotions, and interference of daily activities of the pain at the time of recall. These phenomena may lead to higher and lower recalled pre-operative pain scores in failure and successes, respectively.

Nowadays, it is recognized that patient outcomes for (chronic pain after) hernia surgery cover more than just unilateral, simplified pain scores. More dedicated and multidimensional instruments to assess outcomes have been developed over the years, including the Carolinas Comfort Scale [38], the short-form Inguinal Pain Questionnaire [39], and the Activity Assessment Scale [40]. Unfortunately, no consensus has been reached on what outcome measurements are preferred to date [41]. Chronic pain is multifactorial, and has been linked to worse mental health and both psychosocial and functional factors are known to have an impact on the pain experienced by patients [42]. The Hospital Anxiety Depression Scale and Pain Catastrophizing Scale may give additional insight in these confounding factors and how psychological factors influence the results of pain scores. It is likely that the phenomenon of recall bias is also present when more comprehensive scales are completed by patients in a retrospective manner. The extent of bias in studies using these scales may be less, compared to the conventional pain scores used in the present paper, as more specific activities and functions are assessed. Although it is hypothesized that recall bias also occurs when using the more extensive outcome measurements, results of the present paper cannot be extrapolated with certainty. Additional research is desired to assess the bias in other outcome measures following hernia surgery.

Potential study limitations

Our study has several potential limitations. A possible limitation is the fact that the analysis of recall error is likely to be underpowered, since some of the seven studies included less than 20 eligible patients. However, the main issue is whether prospectively obtained pain scores differ from retrospectively obtained pain scores. Independently of the statistical analysis used to assess this issue, the prospective and recalled pain scores indicate a significant difference.

Selection bias may have been created as populations with incomplete data were excluded. Since the basic characteristics of included and excluded patients were similar, selection bias is less likely to play a role. Publication bias was avoided, since all eligible MMC studies were included irrespective of publication status.

The present study demonstrates that recall bias varied between pain scales. VAS seemed most susceptible to recall error, but this observation relied on one single study. Therefore, no firm conclusions can be drawn and further research is required.

Conclusions

Surgery outcomes in one in seven patients undergoing remedial surgery are misclassified on the basis of retrospectively obtained pre-operative pain scores (success instead of failure, or vice versa). Misclassification is more likely in unsuccessful surgery than successful surgery. Therefore, the estimated effect size in studies using recalled pre-operative pain scores depends upon the actual success rate. Success rates exceeding 67% are underestimated, whereas effect sizes are overestimated when success rates are below 67%. Detailed pain scales seem to be more susceptible for recall errors, but this issue needs further investigation.