Introduction

Neck pain (NP) is a common musculoskeletal disorder causing a limitation in activities ranging from 2 to 11 % in the population suffering from this condition [1]. A meta-analysis on the prognosis of NP reports that pain and disability are still present 1 year after onset [2]. The subsequent health-care utilization and work absenteeism lead to high economic and societal costs [35] whose reduction may be achieved by further understanding of how this disease affects the quality of life.

The complex etiopathogenesis of NP [1] mandates a biopsychosocial approach like the one purposed by the international classification of functioning, disability and health (ICF) [6] which is best investigated through the use of multi-dimensional questionnaires. The cross-cultural adaptation and validation of questionnaires in different languages, besides allowing their use among several countries, may help the development of common strategies to limit the burden of NP across populations.

Among the questionnaires used for measuring the outcomes in people with NP, Ferreira et al. [7] found the Neck Disability Index [8], the Neck Pain and Disability Scale (NPDS) [9], and the Neck Bournemouth Questionnaire (NBQ) [10] to fit suitably the ICF categories. The NBQ is a short-form questionnaire constituted by seven items representing important aspects of the biopsychosocial model. These are encoded in the ICF framework as emotional functioning, sensation of pain, housework, managing daily routine, remunerative and non-remunerative employment, community life, and recreation. Each item is scored on a 0–10 numerical scale, where zero represents absence of limitation, for a total of 70 points [10].

As the NBQ holds validated version in a population of chronic NP patients in English [10], German [11], and French [12] languages and in a population of patients with whiplash associated disorders in Dutch [13], the aim of the present study was to cross-culturally adapt and validate this questionnaire in a sample of Italian chronic NP patients.

Materials

This study was approved by the Ethical Committee of the Azienda Sanitaria Locale 2 Savonese (no. 650-04/07/2013), and permission was obtained from the designer of the original version. The questionnaire was administered at the beginning and at the end of a 4-week physiotherapy treatment program constituted by re-training of breathing (5′), self-mobilization (10′), stretching (10′) and strengthening (10′) of the neck and shoulder girdle muscles, and self-massage in the pain area (5′). The validation of the NBQ-I was structured using the taxonomy and terminology suggested by the COSMIN [14].

Setting and subjects

The participants were recruited in a convenience way among patients attending the outpatient physiotherapy service of the Santa Corona Hospital and affiliated centers from September 2012 to September 2013. The inclusion criteria were: age >18 years, the ability to read and speak Italian fluently, and chronic non-specific NP (>3 months) assessed by a medical doctor specialized in physical and rehabilitation medicine with 20 years of experience. The exclusion criteria were: specific NP, psychiatric and mental deficits, central or peripheral neurological signs, systemic illness, clinical instability (cardiac, respiratory, vascular), and vertebral surgery. Ten patients per item were chosen to estimate the sample size [15].

Translation and cross-cultural adaptation

This process was performed according to well-accepted guidelines [16]. Two Italian mother-tongue independent translators having an excellent knowledge of English and different cultural profiles (medical and humanities) provided the forward translations of the NBQ and reported about any problem arisen during the translation. To gain the development of a culturally adapted version, one of the investigators synthesized the two forward translations in a common version on the basis of the reports and personal opinions of the two translators. Afterward, two independent bilingual translators with English as their mother tongue back-translated the common version, with the aim of reflecting the common English wording, and reported about any problem emerged. They were selected because living in Italy, unaware about the explored concepts and out of medical background. Then, a bilingual committee composed by two clinicians (physiotherapists), and the translators obtained the pre-final version after reviewing all the translations and the reports and considering all the items to reach semantic, idiomatic, experiential and conceptual equivalence. Thirty-two subjects with chronic NP filled up the pre-final version and underwent a semi-structured interview about problems in wording and answering. All the findings were re-evaluated by the committee, although no further adjustment was required. Finally, a coordinating committee reviewed the pre-final version and all the documents produced during each step. Hence, the final Italian version of the questionnaire (NBQ-I) was adopted.

Comparator instruments

The outcome measures used to test the construct validity and responsiveness of the NBQ-I were the Italian versions of the Neck Pain and Disability Scale (NPDS-I) [17] and of the European Quality of Life 5 Dimensional scale (EQ5D) [18], the Numerical Rating Scale for pain intensity (NRS-PAIN) [19], and the Global Perceived Effect (GPE).

The NPDS-I has 20 items, each scoring between 0 (normal function) and 5 (the worst possible situation) and is divided in three sub-scales: neck dysfunction related to general activities, neck dysfunction related to activities of the cervical spine and neck pain, and cognitive-behavioral aspects. Its total score ranges from 0 (no disability) to 100 (maximum disability) [17]. The EQ5D is divided in five qualitative scales related to five dimensions of the quality of life (mobility, self-care, activities of daily living, pain, and anxiety/depression). Each dimension is graded on a three-category ordinal scale where the lowest score represents absence of the problem while the highest a condition extremely compromized. According to the instrument’s score, we obtained a numerical index (IND-EQ5D) where 1,000 correspond to the best perception of quality of life. In addition, a numerical rating scale (0–100 points where 0 is the worst possible condition) measures the health status (NRS-EQ5D) [18]. The NRS-PAIN ranks the pain intensity on a 11-point rating scale ranging from 0 (no pain at all) to 10 (the worst imaginable pain) [19]. The GPE was administered after the 4-week physiotherapy treatment to measure the effect of the intervention on patients’ health status perception. This Likert scale had five response options (+2 = very much improved, +1 = much improved, 0 = No change, −1 = much worse, −2 = very much worse).

Data analysis

The subjects’ clinical and sociodemographic characteristics were described using mean values and standard deviations or counts and percentages. The data analysis was performed using the statistical software R [20]. Specific analyses for each measurement property are described below [14].

Content validity

During the first administration, subjects completed a semi-structured interview provided to test the content validity. The subjects’ judgement was used to investigate the relevance of the items for the study population, therefore, obtaining an assessment of the face validity. The comprehensiveness was evaluated considering the number of missing or multiple responses [21].

Structural validity

An Exploratory Factor Analysis (EFA) to test the dimensionality of the scale was performed after having considered if the Bartlett’s test of sphericity, assessing if the correlation matrix of data is an identity matrix with all items unrelated, was significant and the Kaiser–Meyer–Olkin (KMO) criterion, which tests the sampling adequacy to ensure that the scale items are relevant for factorial analysis, was ≥0.80. After this assessment, the Cattell’s Scree Test was used to determine the number of extracting factors (eigenvalues >1). We used the Maximum Likelihood (ML) method to extract factors and the Promax oblique method to obtain rotated solutions. Communality of each item and its loading on the owning factor were reported and discussed after extraction to verify the amount of variance explained by the extracted factors for each original variable.

Subsequently, we performed a Confirmatory factor analysis (CFA) using the ML method to test the hypothesized model with the number of factors detected using EFA. The fit indices used to evaluate the model fit were the chi-square (χ 2) test, which indicates a good fit when the comparison between the fitted model with the saturated model that fits the covariances perfectly is not significant, and the Root Mean Square Error of Approximation (RMSEA), which should be close to zero and however ≤0.05 for a good fit. Further, the lower and upper bounds of RMSEA 90 % confidence interval should be, respectively, lower than 0.05 and higher than 0.10 [22]. Both the Comparative Fit Index (CFI) and the Tucker–Lewis Index (TLI) should be higher than 0.90 or 0.95 for, respectively, acceptable or good model fit [23].

For possible model mis-specifications, the expected parameter changes (EPC) and modification index (MI) were calculated if the fitting of the original model was unsatisfactory.

Hypotheses testing for construct validity and responsiveness

The construct validity was assessed by means of hypotheses testing using the correlations between the observed test scores of the questionnaires in both the administrations. Also, the validity of responsiveness was assessed by means of the same hypotheses testing process used for the construct validity, with the only difference that the correlations were based on the change scores of the questionnaires, calculated as the difference between the test scores obtained in the two administrations. The Pearson’s r or Spearman’s ρ were the correlation coefficients used according to data distribution checked with the Shapiro–Wilk test.

Our analysis for both the construct validity and the responsiveness concerned primarily the absolute magnitude of scores of the entire questionnaires. Our a priori hypotheses were that the correlations of the NBQ-I with the NPDS-I and with the NRS-PAIN were positive with a magnitude higher than 0.60, while the correlations of the NBQ-I with the IND-EQ5D and with the NRS-EQ5D were negative and ranging between −0.30 and −0.60. Further, we tested the relative magnitude of the entire questionnaires and their sub-scales. Consequently, we generated a list of hypotheses that stated which was the higher between two correlations and the respective magnitude. The rationale for comparing two correlations between questionnaires or their sub-scales was that more similar are the constructs supposed to be measured less would be their difference, and vice versa. A difference of 0.1 point was attributed when questionnaires or subscales measuring the same construct were present in both terms of the correlation. An example is when the subscale 1 of the NBQ was compared with subscales of the NPDS (e.g., hypotheses 10 and 11, Tables 3, 4). A difference of 0.2 points was deemed when in one term of the relative correlation there was the NRS-PAIN. A difference of 0.3 points was hypothesized when the questionnaire or subscales measuring different construct were compared. For example, the correlation of sub-scales measuring the same construct of the NBQ-I and the NPDS-I was considered 0.3 points higher than the correlation between the sub-scales of NBQ-I and the sub-scales of the IND-EQ5D (e.g., hypotheses 13–16, Table 3) because the latter measures constructs concerning basic aspects of quality of life which were supposed to be not directly affected by the presence or absence of NP (e.g., the three response options for item 1, which measure mobility, are: no difficulty in walking and some difficulty in walking and bedridden). The hypotheses testing of the construct validity and the responsiveness were both rated according to the criteria proposed by de Boer et al. [24], which state the validity of these psychometric properties is high if less than 25 % of the hypotheses are refuted, moderate if 25–50 % is refuted, and poor if more than 50 % of the hypotheses are refuted.

Internal consistency

This property, calculated after the factor analysis, was assessed using Cronbach’s α and values ranging between 0.70 and 0.90 were judged acceptable [25]. Correlation of each item with its own subscale total score (item-test), with the total score of remaining items (item-rest) and inter-items correlation were also reported.

Interpretability

The total scores on the pre- and post-treatment conditions and the change scores were summarized using descriptive statistics. The floor and ceiling effects were deemed present when more than 15 % of the patients received the lowest or highest possible scores [26]. Once the questionnaire was completed, the time needed was registered.

The Minimal Clinical Important Difference (MCID) was computed using the ROC curves [27] by comparison of NBQ change scores with the gold standard represented by the subject’s GPE. The treatment was considered beneficial for responses on GPE of +2 and +1. We considered an area under the curve (AUC) of at least 0.70 to be adequate [28].

Results

Subjects

A total of 108 subjects participated in the study, of which 80 were women (74.1 %) and 28 were men (25.9 %). The mean age was 51.5 (SD = 13.6) and the mean duration of symptoms was 12.3 months (SD = 7.5). On average, the sample’s BMI was 24.0 (SD = 3.7). The sociodemographic characteristics are reported in Table 1.

Table 1 Sociodemographic characteristics

At the end of the questionnaire’s administration, a drop out rate of 9.2 % was observed. These subjects abandoned the treatment for personal reasons. The response rates to the GPE were 23.1 % (+2), 45.4 % (+1), 16.7 % (0), 2.8 % (−1), and 12.0 % did not answer (of which 9.2 % dropped-out).

Translation and cross-cultural adaptations

The cross-cultural adaptation of the NBQ in the Italian population was reached without any problem. All the items were translated linearly and the experts agreed during all phases of the process, even though a consensus about semantic, idiomatic and conceptual equivalence was needed.

The interview about comprehension issues delivered to each participant after the administration of the pre-final version highlighted no significant flaws in the questionnaire. Therefore, the coordinating committee adopted the final version of the NBQ-I without any changes. The 32 subjects were enrolled in further administration of the questionnaires for the validation of its psychometric properties.

Content validity

The questionnaire had acceptable face validity because a considerable amount of subjects judged the questionnaire relevant for their health problem (78.7 %) or for other people with neck pain (87.9 %). Further, the response rates to the question “How much is your health problem represented in the NBQ questionnaire?” on a Likert scale with answers’ options “A lot”, “Enough”, “Normal”, “A little” and “Not at all” were 10.1, 46.3, 25.9, 13.9, and 0.0 %, respectively (3.7 % did not answer). The distribution of the total scores showed the absence of missing or multiple responses.

Structural validity

A KMO coefficient of 0.83 and statistical significance (p < 0.001) of Bartlett’s test for sphericity allowed to perform the factor analysis. Only two factors had an eigenvalue higher than 1 and the first explained 56.6 % of all variance, while the second 12.6 %. Table 2 shows the item-factor loadings using oblique Promax rotation and the single item communalities. Items 1, 2, 3, 6, and 7 had a high load on factor 1 and items 4 and 5 on factor 2. All items but 7 had communality higher than 60 %. The two factors showed a correlation of 0.52. The subsequent CFA revealed an optimum fit of the model with the two factors observed in explorative analysis. In fact, a not significant χ 2 test (χ 2 = 16.37; p = 0.23), a RMSEA of 0.049 [90 % CI 0.00–0.11; test for RMSEA ≤0.05 (p value) = 0.46], a CFI of 0.993, and a TLI of 0.989 were detected, all indicating a good fit of the hypothesized model.

Table 2 Loadings obtained from the factor analysis

The factors were named pain and functioning (factor 1; items 2, 1, 3, 6, and 7) and anxiety and depression (factor 2; items 5 and 4).

Hypotheses testing for construct validity

The results of the hypotheses testing process are reported in the fourth and sixth column of Table 3 for the condition before and after treatment, respectively. Before treatment, the expected absolute magnitudes of correlations were all respected but the one with IND-EQ5D. Regarding the 41 hypotheses concerning the relative magnitude of the entire questionnaires and their sub-scales, 15 were rejected. In total, the hypotheses testing process rejected 35.5 % of the generated hypotheses.

Table 3 Hypotheses testing—construct validity pre-treatment and post-treatment

After treatment, the expected absolute magnitudes were all confirmed. The hypotheses testing process about the expected directions of the relative magnitude rejected 14 (31.1 %) out of 45 hypotheses.

Considering the percentage of hypotheses refuted in the pre- and post-treatment conditions, the construct validity was moderate.

Hypotheses testing for responsiveness

The results of the hypotheses testing for this property are reported in Table 4. The hypothesized absolute magnitudes were all rejected but the correlation with the change score of IND-EQ5D. Among the 41 hypotheses on the relative magnitude, 17 were refuted. In total, 44.4 % of the generated hypotheses were rejected and the responsiveness was rated moderate.

Table 4 Hypotheses testing—responsiveness

Internal consistency

Cronbach’s α was 0.89 (95 % CI 0.84–0.92) for the entire questionnaires. When analysing the two sub-scales, the coefficients were 0.88 (95 % CI 0.83–0.92) and 0.90 (95 % CI 0.86–0.94) for factor 1 and 2, respectively.

The item-test correlation was higher than 0.80 for all items but 7 (r = 0.69) and, similarly, the item-rest correlations were higher than 0.70 (for item 7: r = 0.53). Most of the correlations between items were higher than 0.50 with lower correlation between both items 4 and 5 and all other items and between item 7 and all other items with values ranging from 0.40 to 0.50 except for items 4 and 5 (r = 0.29 and 0.32, respectively).

Interpretability

The median and Interquartile Range (IQR) of the entire questionnaire (maximum score = 70) in the pre- and post-treatment conditions were, respectively, 29 (IQR = 14–37.5) and 15 (IQR = 8–24.2). The change score had mean of 9.4 and median of 8 with IQR ranging from 1 to 21.

Factor 1 (maximum score = 50) had, respectively, median 21 (IQR = 10–28.5) and 12 (IQR = 6–19) in the pre- and post-treatment conditions. Its change score had mean 7.1 and median 5 (IQR = 1–14.5).

The median and IQR for factor 2 (maximum score = 20) were 6 (IQR = 2–12) in the pre-treatment condition and 4 (IQR = 1–6) in the post-treatment condition. The change score had mean of 2.4 and median 2 (IQR = 0–5). Factor 2 further showed a floor effect after the treatment because 19.4 % reached the lowest score. The distribution of test and change scores is summarized in Table 5.

Table 5 Description of scores

The time needed to fulfil the whole questionnaire had median 1′ 67″ (IQR = 1′15″–2′25″).

The results of the ROC analysis indicated an MCID of 5.5 points. This cut-off score was the most appropriate to detect a patient-reported improvement with a sensitivity of 0.75, a specificity of 0.60, and an area under the ROC curve of 0.72 (95 % CI 0.61–0.83).

Discussion

The cross-cultural adaptation of the NBQ in Italian showed the absence of any major difficulty during the forth and back translation processes. The review of the expert committee obtained further semantic, idiomatic and conceptual equivalence. The test of the pre-final version confirmed that a reasonable translation was reached because comprehension and equivalence issues were avoided. Also, the results of the face validity pointed out the representation of NP problems in the questionnaire.

The construct validity was firstly investigated through the analysis of structural validity. The factor analysis revealed the NBQ-I is based on a formative model composed by two different unidimensional reflective sub-scales dealing with pain and functioning on the one hand and anxiety and depression on the other. Further, comparison with other versions was not feasible since this is the first study reporting on this property of the NBQ. Although this questionnaire is claimed to cover several aspects of people’s quality of life because it has items asking about them, the results of factor analysis revealed a robust two factors structure. Our recommendation is therefore, to rely on the results of structural analysis of a questionnaire rather than on the content of its single items when considering the number of sub-scales present in a multi-dimensional questionnaire.

The absolute magnitude of correlation of the NBQ with the NPDS (pre-treatment = 0.67, post-treatment = 0.70) were in line with the coefficients provided by the German version (pre-treatment = 0.69, post-treatment = 0.80) [11]. However, we further assessed this property with the hypothesis testing process, which allowed us to gain further insights into the construct validity of the questionnaire. Consequently, we speculated on the relative magnitudes and directions that the correlations of the NBQ and its two sub-scales could have with the other instruments and their sub-scales. It was surprising the lack of difference between the correlation of NBQ/NPDS with the correlation of NBQ/NRS-PAIN or of the correlation between NBQ-factor 1/NPDS-factor 1, i.e., intended to be very similar, with the correlation NBQ-factor 1/NRS-PAIN. These results were confirmed with the post-treatment test scores as well. Therefore, we guessed that the NBQ-I and its factor 1, regarding pain and functioning, are well influenced by pain intensity.

The Cronbach’s α of the entire questionnaire (α = 0.89) was slightly less than the predefined threshold, even though it was similar to the values obtained in English (α = 0.87) and German (α = 0.79). This indicates a high interrelatedness of the items with a slight tendency to redundancy. The internal consistency calculated for the two sub-scales, derived from the factor analysis, revealed a similar pattern. While for factor 2 the high redundancy may be attributable to the overlapping of feelings like anxiety and depression, the results of the CFA may indicate item 7 as unnecessary in factor 1. Considering that the Cronbach’s α is strongly correlated with the length of the scale [25], the high coefficients obtained for both the full questionnaire and its sub-scales further support the high degree of homogeneity of the NBQ.

The hypotheses-testing method was chosen to evaluate the ability of the NBQ-I to detect change scores over time. Although the scores obtained in the other languages suggested a good responsiveness [1012], our results indicated it was moderate. Despite the threshold used for the comparison between correlations, regarding both absolute and relative magnitude, might have been too high, the rationale for choosing the cut-off value of difference was conservative and replied the hypotheses made for construct validity, as in the study of de Boer et al. [24].

The change score (x = 9.4 points; SD = 13.3) reported in this study is similar to that reported by Gay, Madson, and Cieslak [29] (x = 14 points; SD = 11.9) which performed an analogous treatment (e.g., three times/week of manual therapy techniques over 1 month plus home exercise program) in a comparable sample of subjects (e.g., chronic NP); therefore, it seems that the NBQ resembles the same measurement characteristics. The MCID of 5.5 points found in this study approximates the threshold suggested by Bolton and Humphreys [10]. However, other authors [30, 31] found a change in raw score of 13 points to detect an improvement in the patient’s health status with specificity and sensitivity of 100 %.

Among the neck-specific questionnaires, the Neck Disability Index [8] has been recommended in two recent systematic reviews on the measurement properties of original [32] and translated [33] versions of neck-specific health-related quality of life instruments because it constituted the only instruments having all the measurement properties validated and with positive findings. However, the results of the present study add further contribution on the hypotheses testing and the responsiveness, strengthen the internal consistency and provide novel evidence on the measurement properties content validity and structural validity of the NBQ, which now has all the measurement properties validated with performance comparable to those of the NDI. Therefore, it is plausible that the NBQ may be reconsidered in further updated reviews on the measurement properties of neck-specific questionnaires as an additional valid instrument to measure the quality of life in people suffering from neck pain.

Even though the validation process was made according to the principles stated by the COSMIN, few issues could limit the validity of the Italian version of the NBQ. Among all the psychometric properties evaluated the reliability is missing because, as the subjects were attending a physiotherapy service, it was considered unethical to interrupt or delay it in order to keep their condition stable enough to test the reliability of the questionnaire. Further, the threshold used to compare the relative magnitudes of correlation among different questionnaires for both the construct validity and the responsiveness may have been too high and this could have biased their real validity. Nevertheless, as the hypotheses testing process is a relatively new validation process, we tried to establish a simple rule that could serve as a reference for future validations of different questionnaires. Among the outcome measures chosen in this validation process, we lacked the Neck Disability Index [8], one of the routinely used questionnaires across the literature concerning observational and intervention studies on neck pain, because it was not validated yet when our study began. Finally, the responsiveness of the NBQ may be considered not fully achieved because the outcome measures used to test it had the description of their responsiveness based on the significance of their change scores rather than on their validity.

Conclusions

The NBQ has a two-factor structure whose construct validity and responsiveness are moderate. The internal consistency indicated the items of the NBQ have a high degree of interrelatedness. The results in change score obtained in the Italian population are similar to that in different populations and a clinical improvement is detected by the NBQ when the change score is greater than 5.5 points.