Introduction

"Yellow flags" are psychological factors and maladaptive beliefs that act as risk factors for persistent pain and prolonged disability in relation to musculoskeletal symptoms [1, 2]. They concern the features that affect how a person manages their situation with regard to their thoughts, feelings and behaviours. The flags are not a diagnosis or a symptom, but an indication that someone may not recover as expected. Some studies have shown that the presence of yellow flags, such as psychological distress/depression [3,4,5], fear-avoidance beliefs [6, 7] and anxiety [8], also increases the likelihood of a poor outcome after spine surgery. For this reason, such risk factors may influence clinicians' perceptions of the suitability of a patient for a surgical intervention [9] or their opinion of the "appropriateness of surgery" in individual cases [10]. However, even if spine surgeons are cognisant of the flag concept and its importance, many have difficulty detecting yellow flags during the consultation [11] and they rarely formally screen for them [9]. This may be the result of the length and scoring complexity of the current instruments, time constraints in routine consultations, or the perception of not being specifically trained to manage psychosocial attributes identified by such tests [12]. While established self-report instruments exist to evaluate most of the yellow flag constructs of interest, lengthy questionnaires are not suitable for use in the routine clinical setting, where the compliance/involvement of all patients is desired and brevity is of the essence. Further, although brief yellow flag screening instruments have been developed for use in primary care [13, 14] or outpatient physiotherapy [15], these may not be appropriate for use in surgical patients, who appear to be a distinct group with respect to their psychological status pre-treatment [16].

The aim of this study was to create a new, brief tool to routinely assess the yellow flag status of patients being considered for spine surgery, and to evaluate its predictive validity in relation to the outcome of surgery.

Methods

The development of the “yellow-flag” tool followed two phases, as summarised below (details regarding the specific questionnaires and the statistical procedures used are given later, in the respective sections).

Phase 1: strategy to select the “yellow-flag” single items

The multidimensional Core Outcome Measures Index (COMI) [17, 18] comprises single items covering the key outcome domains in patients with spinal disorders and has become a useful tool in the routine evaluation of patient outcome. In accordance with the philosophy behind the COMI of keeping responder burden to a minimum, we sought to develop a complementary set of single-item measures with standardised 5-point response options to assess four of the "core" yellow flags (depression, anxiety, catastrophising and fear-avoidance) [3, 7, 8, 13, 19]. Our previous outcome studies in patients with spinal disorders have provided us with many large datasets containing patients' individual item scores for full-length, established questionnaires addressing these four yellow flags. Table 1 gives a description of the patient samples and the references to the original studies from which the data were taken. The data were derived from a total sample of 932 patients (61 ± 16 years; 51% female; 64% surgical) presenting with spinal problems in secondary or tertiary care. Not all patients had completed each questionnaire, depending on the study they were involved in (see Table 1).

Table 1 Data sources for the secondary analyses used to identify core yellow-flag items for the CYFI

We carried out a secondary analysis of these datasets to select the item that in each case best represented the corresponding full questionnaire while also making sense as a stand-alone question for inclusion in a short set of yellow flag questions, to be coined the "Core Yellow Flags Index" (CYFI). Item quality was assessed using the criteria developed by Stanton et al. [20]. Final judgements about the clinical importance of the best single items for the four instruments were made by an expert group comprising spine surgeons, a methodologist and researchers in the field of spine outcome measures.

Phase 2: test of factor structure and prognostic validity of the four yellow-flag items

In a second phase, we tested the factor structure and prognostic validity of the CYFI using new clinical data collected from May 2015 to Apr 2018. A total of N = 3344 patients undergoing surgery of the thoracolumbar spine were asked to complete the CYFI and the COMI, preoperatively, and the COMI at 3- and 12-month follow-up (FU). Questionnaires were completed preoperatively by 2971 (89%) patients, and at 3-month and 12-month FU by 2940 (88%) and 2738 (82%), respectively. A total of 2422 (73%) patients (64.4 ± 15.8 years; 54% female) completed all questionnaires at all three time-points (baseline and both follow-ups). The "Main Pathology" as documented on the Spine Tango surgery form (v.2011; https://www.eurospine.org/forms.htm) was degenerative disease in 1963 (81%) patients, repeat surgery in 194 (8%) and various other pathologies (such as non-degenerative deformity or spondylolisthesis, fracture or trauma, inflammation, infection, tumour, other) in the remaining 265 (11%) patients.

The test–retest reliability of CYFI was assessed in a subgroup of 56 patients (66.3 ± 13.4 years; 55% female) who completed the questionnaire on two occasions preoperatively, 5 ± 9 days apart.

Questionnaires

The questionnaires used to identify the single item yellow flags included:

  • the 6-item catastrophising sub-scale of the Coping Strategies Questionnaire (CSQ) [21], or the Pain Catastrophising Scale (PCS) [22, 23]

  • the ZUNG Self-rated Depression questionnaire [24]

  • the Hospital Anxiety and Depression Scale (HADS) Anxiety subscale [25, 26]

  • the physical activity sub-scale of the Fear-Avoidance Beliefs Questionnaire (FABQ) [6, 27], to assess beliefs about activity being a cause of the patient's back trouble and fears about the dangers of such activities when experiencing an episode of low back pain.

    The questionnaires used to assess the concurrent validity of the single item yellow flags included:

  • Visual Analogue Scale (VAS) or graphic/numeric rating scale (GRS/NRS) to measure representative (back or leg) spine-problem-related pain in the last week [28]

  • Roland and Morris questionnaire (RMQ), a 24-item questionnaire that assesses disability due to low back pain in relation to various daily functions/activities [29, 30].

The longitudinal validity of the single item flags was evaluated in relation to the COMI.

  • The COMI is a 7-item instrument scored 0–10 and comprises questions covering the domains: pain intensity (axial and peripheral, measured separately); function; symptom specific well-being; general quality of life; and social and work disability [17, 31].

All the questionnaires were either originally developed in German or had been adapted and validated for the German language prior to their use in the studies listed in Table 1.

Statistical analyses

Phase 1

Items were favoured for CYFI that: (a) showed a high corrected item–total correlation, i.e. the value of the item corresponded closely to the total scale score without the respective item, indicating the representativeness of the item score for the total scale and its adequacy in representing the construct as a single item; (b) did not display large floor or ceiling effects (i.e. high proportions of scores representing the lowest or highest score possible), that might otherwise indicate a lack of discriminative function, and (c) in Spearman rank correlation analyses, had a meaningful relationship with pain intensity and disability, the clinical outcome measures that have previously been shown to correlate with yellow flag items.

Phase 2

The new sample of data from 2422 surgical patients was analysed using structural equation modelling (SEM). Confirmatory factor analysis was carried out on the preoperative CYFI data, to examine whether the single items corresponded to a single yellow-flag factor, i.e. had a one-dimensional factorial structure with high item loadings on a common factor. Cronbach’s alpha was used to assess the internal consistency of the CYFI (≥ 0.70 considered good, [32]).

The hypothesis involving longitudinal data (i.e. that CYFI would add to the prediction of follow-up COMI scores, over and above baseline COMI scores) was tested using SEM by examining the longitudinal directional paths between CYFI at baseline and COMI scores at follow-up, controlling for age, and spinal pathology; this was entitled the "prospective risk path". We estimated risk paths separately for men and women because the prevalence of yellow flags seems to differ between men and women and because the first test of a model that did not allow their risk paths to differ was a worse fit to the empirical data than a model that allowed differences in risk paths. Path coefficients were considered small (0.10), moderate (0.30) and large (0.50) in relation to the effect size classification of Cohen [33].

The reproducibility of single yellow-flag item scores was tested using quadratic weighted Kappas and that of the whole CYFI score was tested with intraclass correlation coefficients (ICC) (in each case, ≥ 0.60 is considered substantial [34]).

The analyses were performed using IBM SPSS (IBM SPSS Statistics for Windows, Version 20.0. Armonk, NY: IBM Corp), and AMOS 18.0 software (for the confirmatory factor analysis (CFA) and prospective risk path analyses).

Results

Selection of best yellow-flag single items representing their scales (phase 1)

For PCS, two datasets were available (studies B and D) and for the CSQ catastrophising subscale, one (E) (Table 2). The item "It's terrible and I think it's never going to get any better" (present in both CSQ and PCS) proved to be the best for representing catastrophising. It showed the most consistently high corrected item–total correlations for all studies (0.75, 0.80, and 0.66 for B, D and E, respectively). Compared with most other items of the PCS, floor effects were in the mid-range (33.6%, 39.4% and 33.5%, respectively); there were a few items with lower floor effects, but these were poor in other item characteristics. The chosen item had consistent correlations with pain (0.31, 0.20 and 0.33, respectively) and with the RMQ score (0.52, 0.37 and 0.21, respectively). Finally, the item was verified in the expert group to be one of the best items to represent the pain catastrophising construct as a "stand-alone" item.

Table 2 Results of the statistical analyses to identify the best item representing the domain catastrophising. Item 3, "It's terrible and I think it's never going to get any better" (highlighted in bold), was chosen as the best
Table 3 Results of the statistical analyses to identify the best item representing the domain depression. Item 1, "I feel down-hearted, blue and sad" (highlighted in bold), was chosen as the best
Table 4 Results of the statistical analyses to identify the best item representing the domain anxiety. Item 5, "Worrying thoughts go through my mind" (highlighted in bold), was chosen as the best
Table 5 Results of the statistical analyses to identify the best item representing the domain fear-avoidance beliefs. Item 2, "Physical activity might harm my back." (highlighted in bold), was chosen as the best

The ZUNG Depression scale consists of 20 items. For this construct, data from 3 independent samples were analysed (studies A, B and D) (Table 3). The best stand-alone item for the depression scale was found to be “I feel down-hearted, blue and sad”. The item represents the construct very well (corrected item–total correlations in the three samples were 0.67, 0.69 and 0.66, respectively). Floor effects were large (30.6%, 53.0% and 46.7%), but compared with most other items of the ZUNG they were in the mid-range. Correlations with pain in the last week were relatively low but consistent (0.14, 0.19 and 0.17, respectively), whereas those with Roland–Morris disability scores were moderately high and also consistent (0.30, 0.41 and 0.37, respectively). In addition, the item was verified in the expert group to be the most useful stand-alone item for representing the depression construct. Item 20 also showed good item quality in sample A, though less good in B and D, but we considered it unclear whether “not enjoying the things I used to enjoy” might be reflecting the lack of pleasure due to physical pain rather than the depressed mood.

The anxiety subscale of the Hospital Anxiety and Depression Scale (HADS) consists of 7 items, and data from one study (C) were analysed to identify the best fitting single item (Table 4). The item that performed best was item 5 “Worrying thoughts go through my mind”. The item showed the highest corrected item–total correlation of all items in the scale (0.69), confirming that it represented the total anxiety score very well. Floor effects were large (52.3%), but about in the mid-range of values for all the seven items (32–76%). The correlation between this item and pain in the last week was the second highest of all the seven items (0.19), and its correlation with disability was third highest (0.22, with the highest correlation being 0.30). Item 1 “I feel tense or ‘wound up’” also showed good item quality, but it was felt the colloquialism "being wound up" may have made it unsuitable for use as a stand-alone item, and perhaps caused difficulties with later translations into other languages. Hence, with item 5 (“worrying thoughts…”) having the highest item–total correlation, and wording suitable for a stand-alone item, the experts rated this as the best to represent anxiety.

The physical activity subscale of the Fear-Avoidance Beliefs Questionnaire comprises four items, and data were available from four data-sets (studies A, B, C, and D) (Table 5). The item “Physical activity might harm my back” was chosen as the best. It was not “the best” in any of the criteria, but it was always good and more consistently good across the four samples than were other items (respectively, corrected item–total correlation: 0.75, 0.66, 0.62, 0.61; floor effects: 20.6%, 16.1%, 22.7%, 9.0%; correlation with pain: 0.17, 0.23, 0.29, 0.19; correlation with disability: 0.40, 0.45, 0.45, 0.37). Experts rated the item as the best and most credible as a stand-alone item in representing the FABQ-Activity subscale.

The final wording of the CYFI items in English and other languages (official national languages or native languages commonly spoken by patients attending the authors' Spine Center, for which published versions of the full-length questionnaires were available) is shown in Table 6.

Table 6 German, English, French, Italian, Spanish, Portuguese and Hungarian versions of the CYFI (see footnote for further details)

Test of factor structure and prognostic validity of the four yellow-flag items (phase 2)

Confirmatory factor analysis showed that the 4 yellow flag items represented a common latent construct (CYFI), with age and pathology being controlled for, and with the 4 CYFI-item loadings on the common CYFI factor being constrained to be the same for men and women (RMSEA = 0.05, CFI = 0.96, χ2 (19) = 141.60, χ2/df = 7.45). Cronbach's alpha for the four yellow-flag items was 0.79, showing good internal consistency.

The test of prognostic validity for CYFI included a structural equation model with CYFI predicting COMI at 3-month follow-up and 12-month follow-up while controlling for preoperative COMI and pathology (Fig. 1). On a cross-sectional basis, preoperative CYFI and COMI scores were highly correlated (Fig. 1: β = 0.52 for men, β = 0.42 for women; each p < 0.001). CYFI explained a significant proportion of the variance in COMI at 3-month FU (β = 0.24, approximately 8% variance explained in men and β = 0.11, approximately 2% variance in women, p < 0.001; Fig. 1), i.e. CYFI contributed to a small but significant extent to explaining the treatment effect. The stability between COMI at baseline and COMI at 3-month FU was low—due to the treatment—with β = 0.15 in men, β = 0.20 in women (Fig. 1). The stability between COMI at 3-month FU and COMI at 12-month FU was high (β = 0.61 in men, β = 0.55 in women, p < 0.001; Fig. 1). Nonetheless, CYFI added significantly and independently to the prediction of COMI at 12-month FU (β = 0.14 in men, approx. 4% variance explained, p < 0.001; β = 0.13 in women, approx. 3% variance explained, p < 0.001; Fig. 1) and explained variation in the COMI at 12-month FU that was not explained by individual differences in COMI existing at either baseline or 3-month FU. The fit of the model was good (RMSEA = 0.04, CFI = 0.97, χ2 (39) =  216.92, χ2/df = 5.56).

Fig. 1
figure 1

Results of the structural equation modelling showing the factor analysis of the CYFI and the correlations between CYFI at baseline and COMI at follow-up (FU), controlling for preoperative COMI, sex and age. The fit of the model was good (RMSEA = 0.04, CFI = 0.97, χ2 = 216.92, df = 39, χ2/df = 5.56). The first coefficient in each pathway indicates the standardised regression coefficient for men, and the second, for women *** all p < 0.001, two-tailed

Test retest reliability for each item of the CYFI was 0.60–0.76 and for the CYFI whole score, 0.72 (95% CI 0.58–0.86).

Discussion

Our study showed that the newly developed 4-item CYFI constitutes a simple, practicable, reliable and valid tool for routinely assessing key psychological attributes in patients undergoing treatment for spinal disorders in tertiary care. The brevity of the CYFI should make it a useful addition to the brief COMI in the self-assessment of baseline status before surgery. It may be used by clinicians to orientate themselves with regard to the yellow flag status of their patients, and its data may be able to strengthen the existing predictor models of surgical outcome.

A number of brief tools exist to assess yellow flags, but these have focused on chronic LBP patients in primary care, occupational health or physical therapy settings [13,14,15, 35]. Several factors provided the impetus for us to create a new tool designed to be used with surgical patients. Patients in tertiary care are intrinsically different from those in primary care, in terms of both their symptom severity and degree of psychological disturbance [16]. In creating our own tool, we wished to use, as a basis, questionnaires that had previously been used with patients in secondary and tertiary care study settings. We also wanted to select items from questionnaires that were available in our 3 national languages (German, French and Italian) as well as English and other languages spoken in our country for which a version of COMI exists (see Table 6). Further, rather than employing a binary response option (yes/no to whether the statement applies), as used for example in the STarT Back, we wanted to offer a 5-point graded scale that would be consistent with the items in the COMI. Nonetheless, in considering the final items for inclusion in our tool, we attempted to align with the STarT Back, where feasible and supported by the item-quality analyses. The STarT Back items did not all come from the same full-length questionnaires as used in the present study: they were the same for anxiety (i.e. HADS) and catastrophising (i.e. PCS), and the same two items were considered to be most representative of these domains in both studies. The depression item in the STarT Back (“in general, I have not enjoyed all the things I used to enjoy”) came from the Patient Health Questionnaire (PHQ-2) rather than the ZUNG. The ZUNG contains a similar item (item 20) and, although it showed good item quality in our sample A, it was not consistently good for samples C and D (Table 3). Moreover, when presented as a stand-alone item, we considered that "not enjoying the things I used to enjoy" was too unspecific as a depression item, liable to inadvertently capture the impact of pain on the enjoyment of activities rather than the mental state of being depressed and losing interest (especially in surgical patients with their higher pain levels). The fear item in the STarT Back (“not safe for a person with a condition like mine to be physically active”) originates from the Tampa Kinesiophobia questionnaire and could perhaps be considered a more unwieldy way of saying "Physical activity might harm my back" (our chosen FABQ item), albeit with some ambiguity in the interpretation of the word “safe”. Rasch analyses have previously identified this Tampa Kinesiophobia item as being psychometrically poor [36] and showing differential item functioning with respect to gender [37]. Interestingly, recent qualitative analyses performed by the STarT Back group revealed that the STarT Back depression and fear items were considered “cumbersome” by both patients and general practitioners alike [38]. This substantiates our aforementioned misgivings about these two items. Despite the above differences, test–retest reliabilities were similar for the two tools: the quadratic weighted kappa for the psychosocial subscale of the STarT Back completed by all 53 patients studied was 0.69 (0.51–0.81) and, for 23 of their patients reporting stable symptoms, 0.76 (0.52–0.89) [13]; for the CYFI, the corresponding value was 0.72 (95% CI 0.58–0.86).

Identifying a need to include a yellow flag measure in the baseline assessment of back pain patients, Cedraschi et al. [35] added two yellow flag questions to the COMI, to assess depression and anxiety. The wording was created by the authors, rather than being extracted from established questionnaires, and simply enquired “how much did you feel anxious?” and “how much did you feel depressed?”, with a list of 5–6 thoughts and feelings being provided for each question as examples of what it might mean to feel anxious or depressed. Such "double/multiple-barrelled" (or compound) questions that enquire about many feelings/thoughts within one and the same question can pose difficulties, since respondents wishing to endorse only one of the options might be confused how to answer [32]. Moreover, the predictive validity of their flag items in relation to outcome was not evaluated. It was suggested that the items be incorporated into the existing COMI to provide a modified-COMI with a psychological dimension, by taking the higher of the two scores (anxiety or depression) and averaging it with the remaining COMI item scores. We see numerous problems with this. Firstly, it would cause confusion with respect to the scoring of the COMI as an outcome instrument and would render incomparable the scores from studies with and without the flag questions. Secondly, the psychological items do not constitute key outcomes for many spinal disorders; they may be important predictors or screening items, but they are not “core outcomes” [39], which means inclusion of their scores in the overall COMI score would likely reduce the responsiveness of the instrument (as was seen in [35]). For the CYFI, our recommendation is to view it as an independent tool, calculating an unweighted sum-score for its four items, since in factor analysis all made a reasonable contribution to the latent variable "yellow flags" (Fig. 1).

We showed that the CYFI made a significant independent contribution to the prediction of COMI scores at 3- and 12-month follow-up. Our findings were hence in keeping with the numerous studies that have shown that higher scores on yellow flag questionnaires generally predispose to poorer outcome [40,41,42]. In the present study, the proportion of variance in outcome accounted for by CYFI (2–8%, depending on gender and follow-up time-point) was greater than that reported for the psychological variables in some previous studies (0–2% [6, 43, 44]) and lower than that reported in others (15–20% [4]). In many studies, only the statistical significance of the effect or the variance accounted for by the whole model was reported, rather than the size of the effect for the psychological variables per se, making it difficult to draw comparisons [45, 46] (and see reviews in [40, 41, 47]). Also, some of the published studies were not truly prospective and most omitted from their models the cross-sectional relationship between psychosocial factors and baseline outcome scores. In the present study, COMI and CYFI were highly correlated at baseline, meaning that the unique contribution of CYFI in predicting COMI at follow-up—beyond that explained by COMI at baseline—was somewhat limited. In our prediction of 12-month COMI, there was, in addition to the direct effect of CYFI, also the indirect effect of CYFI on COMI at 12 months that was mediated by COMI at 3 months. The strong correlation between baseline COMI and CYFI probably indicates that the psychological status of patients at baseline is closely related to their ongoing pain problem and reflects to a lesser extent psychological problems beyond this. In other words, the yellow flags measured in the current sample have a more "situational" origin, driven by current pain and disability, and less of a "stable" trait-like origin reflecting psychological problems unrelated to current pain and disability. The situational component of CYFI is probably less powerful in predicting outcome compared with the more stable component. It is also highly likely that in some patients the psychological factors play a major role, whereas in other patients they have no significance. This has been reported in the literature before, where psychological factors appear to have a greater part to play in more "contentious" diagnoses for which the indication for surgery is less certain, compared with those for which the indication is more clear-cut [41]. Further investigations in this area are warranted such that we might direct our future attention to those patients whose outcome is especially influenced by psychological factors. It is difficult to do true experimental studies in this field to prove causality; however, the future collection of CYFI data also at follow-up, in addition to COMI data, and the use of cross-lagged panel correlations, might provide a method for identifying the source, direction and extent of the associations.

The observation that psychological variables significantly influence outcome often provokes the discussion as to whether, having identified that a patient demonstrates significant yellow flags, surgery should still be recommended. We do not believe that the effect size (in the present study, small to moderate; see above) is great enough to promote the CYFI as a tool to be used to deny operative procedures to patients who otherwise have a clear clinical indication for surgery. Indeed, to the authors' knowledge, no such psychological screening tool currently exists, and it is well known that many high-scoring patients still derive great benefit from surgery. Instead, we believe the current findings provide an impetus for administering the CYFI as part of a systematic collection of baseline data, along with numerous other risk factors, such that these can be included in predictive analytical models to improve the accuracy of individual outcome prediction. Many factors ultimately contribute to explaining the variance in individual outcomes; the more variables we are able to identify that make a significant contribution, the more accurate our overall predictions should be. Having a knowledge of the preoperative CYFI score for individual patients may also be useful in daily clinical practice to open a dialogue about these issues with the patient and to better manage their expectations of treatment. This may minimise the subsequent dissatisfaction with outcome that can follow from having overly optimistic expectations [48]. The findings might also be considered as support for more research on the clinical benefit of cognitive behavioural therapy (CBT) accompanying surgery. A number of studies [49,50,51] have shown positive effects, and this is a field of ongoing study, particularly in relation to the selection of appropriate cases.

Our study had a number of limitations. First, the data used in the development of the CYFI were from patients in secondary or tertiary care; the majority, but not all, were surgical patients. Second, the CYFI contains only “negative items” and there are no items enquiring about positive affect, coping strategies or resilience. Although these attributes are often believed to be the "opposite" of the yellow flag attributes, in some studies of spine surgery patients they have been shown to contribute to the prediction of outcome [43]. Third, in the longitudinal study, questionnaires were not completed by all patients at baseline (11% failed to complete one, mostly due to language problems, administrative errors, and emergency admissions) and other patients did not return a questionnaire at 3-month or 12-month follow-up (12–18%). This may have introduced attrition bias in the findings. Fourth, the reason that sex-specific models showed better fit currently eludes us. However, it is important to appreciate that yellow flags do not operate in isolation from other factors [2], and more elaborate models will ultimately be required. Further, such models should be externally validated (i.e. tested for their predictive ability in a separate population of patients), a step that was beyond the scope of the present study. The CYFI items were taken from published versions of the corresponding full-length questionnaires in each language. Nonetheless, confirmation of the adequacy of the different language versions as a group of items and of the corresponding introduction and response options, which have not been formally validated (Table 6), along with further evaluation of the performance of the CYFI (internal and test–retest reliability, construct and longitudinal validity, etc.) in each language, is encouraged. And finally, we cannot yet advise on the cut-offs required for indicating that a patient is "yellow flag positive", on a binary basis; we hope to address this in future studies.

In summary, the 4-item CYFI proved to be a simple, practicable, reliable and valid tool for routinely assessing key psychological attributes in spine surgery patients. The CYFI made a statistically significant contribution to the prediction of patient outcome after surgery. In this way, its widespread use may assist in developing better outcome prediction tools, based on the systematic collection of baseline data, e.g. in spine registries. The brevity of the instrument makes it suitable for implementation in everyday clinical practice, as part of the baseline assessment of patients undergoing spine surgery.