Background

Cost-utility analysis is an increasingly important aspect of health technology assessment, where cost per quality-adjusted life year (QALY) gained is used as a primary endpoint for a novel intervention [1]. The QALY is an estimate of health that represents both survival and health-related quality of life (HRQoL) as a single number [2], where the HRQoL is measured by using a health state utility value, which could be captured by questionnaires. A commonly used preference-based instrument, such as the EuroQol five-dimension (EQ-5D) and the six-dimension health state short form (derived from the short form 36 health survey [SF-36] and the short form 12 health survey [SF-12]) (SF-6D), is used to generate the health-state utility scores [3, 4]. By using health-state utility values, health service researchers can estimate and compare QALYs across different interventions, which can provide decision-making information for healthcare resource allocation [5]. The absence of utility values derived from generic preference-based instruments, however, is a barrier to populating economic models with the best evidence of effectiveness. Statistical regression-based mapping is one method that is gaining popularity when no preference-based instruments are available in the study [6].

The measurement of HRQoL in the pain setting is usually carried out using pain-specific questionnaires rather than generic preference-based instruments because the questionnaires pay attention to relevant health problems and tend to gain more clinically meaningful changes [7, 8]. The most commonly used measurement for evaluating disability in patients with neck pain is the neck disability index (NDI), which comprises a 10-item self-administered questionnaire that was modeled on the Oswestry Back Disability Index by Vernon and Mior in 1991 [9]. Preference-based measures of health status, however, are not always performed.

Neck pain is a common problem with an annual incidence rate ranging from 0.055 cases per 1000 persons (disc herniation with radiculopathy) to 213 cases per 1000 persons (self-reported neck pain) [10, 11]. Nearly two-thirds of adults will experience neck pain during their lives Neck pain is more prevalent among women and middle-aged adults. Some occupations, such as office work, are also risk factors [12]. Due to the necessity of health economic analysis in the area of neck pain, deriving utility values from pain-specific questionnaires is necessary.

Recent studies have shown that there is a strong relationship between the SF-6D and the total NDI scores [13, 14]. However, analysis of mapping items data of NDI to SF-6D in patients with chronic neck pain (CNP) is limited. The aim of the present study was to establish a mapping relationship that uses HRQoL data from the NDI to estimate SF-6D utility values in patients with CNP in order to facilitate economic evaluation of novel therapies for CNP and imply future intervention study of improving utilities by targeting some items.

Methods

Data source

Patients with chronic non-specific neck pain were enrolled from the Pain Management Department of Huadong Hospital. The study was done in accordance with the ethical standards of the Declaration of Helsinki, the International Conference on Harmonisation’s good clinical practice, local laws, and applicable regulatory requirements. The study was approved by the Ethics Committee of Huadong Hospital in Shanghai, China. Written informed consent was obtained from all patients before enrollment.

During the first treatment visit after study enrollment, data was collected on demographics and clinical features, as well as across a range of patient-reported outcomes from patient interviews and from two self-administered HRQoL questionnaires: NDI and SF-6D. They also answered questions on socio-demographic information (sex, age, education, housing type and income), self-rated health status (“Compared with past 12 months, what do you think about your present health condition? Better, The same, Worse”) and life style information (e.g. smoking). Eligible patients were aged ≥ 18 years, had neck pain during at least the 3 months prior to the study and had no exercise-related risks. To reduce the potential factor that might interact with neck pain [15], patients were ineligible if they had one or more of the following conditions: existing vestibular pathology; receiving medical intervention in the last 3 months; cervical fracture or dislocation; systemic diseases; neurological, cardiovascular or respiratory disorders affecting physical performance; history of traumatic head injury; inability to provide informed consent; pregnancy.

Measures

  1. 1.

    Neck-specific disability

    • A validated Chinese version of the original 10-item Neck Disability Index (NDI) was employed in this study [16]. The NDI, which is derived from the Oswestry Index and designed for assessing neck pain and disability [17], contains 10 self-reported items: pain intensity, personal care, lifting, reading, headache, concentration, work, driving, sleeping, and recreation [18]. Each item is scored on a 6-point scale from 0 (no disability) to 5 (full disability); the sum score out of all 10 items is calculated using a percentage of the maximal score, with higher values representing greater disability.

  2. 2.

    SF-36/SF-6D

    • The SF-36 is a widely used instrument for measuring general health status, and comprises36 self-report questions regarding functional health and well-being [19]. SF-36v2™ was used in this study, and the SF-6D was derived from11 items identified from the SF-36, which comprises six multi-level dimensions of health (physical health problems, bodily pain, general health, vitality, social functioning, role limitations due to emotional problems, and mental health), with each dimension having four to six levels [20]. This classification of health status yields 18,000 possible health state scores, which can range from 0 to 1.0, where 0 indicates the worst health state and 1.0 the best health state. The utility score can be used in the health economic evaluation of interventions and for population health surveys [21, 22]. The Hong Kong Chinese version and HK scoring algorithm of SF-6D was not adopted because it was not widely validated as United Kingdom version and also used less health states in developing the algorithm [23].

Statistical methods

The SF-6D index scores were regressed onto the individual item scores of NDI by using the ordinary least squares (OLS) regression with the suggestion that more complex models may not add predictive power or reduce errors in prediction [24]. The scores for individual and all items in the NDI in turn were offered to the model to give a general algorithm for predicting SF-6D scores:

$$ \mathrm{S}\mathrm{F}\hbox{-} 6\mathrm{D}\kern0.5em \mathrm{score}=\alpha +\Sigma \beta \times \left( Scale\kern0.5em score\right) $$
(1)

All variables were treated as continuous variables. Backward stepwise selection with a significance level of 0.1 from the full model was used to identify statistically significant variables.

When data have pronounced ceiling effects, which is a common phenomenon observed in both health profile and preference-based measures, the use of OLS regression violates the statistical requirement for linearity of conditional expectation, leading to inaccurate predictions of preference-based scores and inaccurate identification of predictor variables. To address this ceiling effects and reduce the potential estimation errors of OLS method, generalized linear modeling (GLM) [25], Censored Least Absolute Deviations (CLAD) and the Tobit model were used to provide an extensional analysis [26].

Because the main purpose of a mapping study is to derive an algorithm that accurately derives health state utility values from other data sets, mean absolute error (MAE) and root-mean-squared error (RMSE) were employed to assess the goodness of fit of the models according to external guidance [6]. RMSE was normalized to the range of score forSF-6D and expressed as % RMSE [27]. The MAE is the mean of absolute differences between the observed SF-6D utility score and the SF-6D utility score predicted from the model; the RMSE is the positive square root of the mean squared estimation error between the observed and predicted values. Smaller MAE and RMSE values suggest better model performance. Fit was decided based on lowest RMSE. The best-fitting model(s) were then re-estimated using data for the validation cohort. External validation would be conducted based on the following equations between total NDI score and SF-6D utilities [13, 14].

$$ \mathrm{S}\mathrm{F}\hbox{-} 6\mathrm{D}\kern0.5em \mathrm{score}=\hbox{-} 0.0115* Total\kern0.5em scores\kern0.5em of\kern0.5em NDI+0.8383 $$
(2)
$$ \mathrm{S}\mathrm{F}\hbox{-} 6\mathrm{D}\kern0.5em \mathrm{score}=\hbox{-} 0.0135* Total\kern0.5em scores\kern0.5em of\kern0.5em NDI+0.8686 $$
(3)
$$ \mathrm{S}\mathrm{F}\hbox{-} 6\mathrm{D}\kern0.5em \mathrm{score}=\hbox{-} 0.0115* Total\kern0.5em scores\kern0.5em of\kern0.5em NDI+0.8391 $$
(4)
$$ \mathrm{S}\mathrm{F}\hbox{-} 6\mathrm{D}\kern0.5em \mathrm{score}=\hbox{-} 0.01005* Total\kern0.5em scores\kern0.5em of\kern0.5em NDI+0.78219 $$
(5)

Due the absence of external data sets, external validation samples were created by randomly selecting half the derivation and validation sample [27]. Statistical analyses were carried out using the statistical programming environment R (R Development Core Team, 2014). P < 0.05 was considered statistically significant.

Results

Patient characteristics, HRQoL scores, and utility values

The study enrolled 272 patients (180 randomly assigned into the derivation set and 92 in the validation set). Baseline patient characteristics are listed in Table 1. Just over one half (57.8 %) of the respondents in the whole set were female; the average (± standard deviation [SD]) age at enrollment was 39.9 ± 12.3 years; time from diagnosis was 27.4 ± 7.37 months; nearly 45 % were college-educated. There were no significant differences in the derivation and validation sets across all variables.

Table 1 Descriptive statistics for the whole, derivation and validation sets

Table 2 shows the summary characteristics of the outcome measures across the whole, estimation and validation sets. In the whole set, the average utility value of SF-6D was 0.519 ± 0.092 and the inter quartile range (IQR) was 0.454 to 0.563. The minimum observed utility value was 0.19 and the maximum was 0.82. The mean NDI Global score was 28.46 ± 12.93 and the IQR was 18 to 40. The mean pain intensity score was near 3.5, which was higher than other item scores. Personal care item had the lowest average scores (2.14). The total scores of the ten NDI items were similar between derivation and validation sets.

Table 2 Summary characteristics of outcome measures across the samples

Mapping NDI to SF-6D

Results of the OLS regression analysis are summarized in Table 3. In the trimmed model (i.e., the backward elimination model), recreation item showed statistical significance. The pain intensity and sleeping item, which had a p-value of 0.0146 and 0.0064 in the full model, became statistically insignificant in the trimmed model, in which it had a p-value of 0.2 and 0.45, respectively. Recreation Recreation was the most influential item in both models (Table 3). Results of the GLM, Tobit and CLAD method regression analysis are presented in Appendix Tables 5-7.

Table 3 Simple correlations and multiple regression multiple regression analyses mapping NDI data to EQ-5D

Table 4 shows the model performance for both the derivation and validation sets when the trimmed model was fitted. The MAE values for the derivation and validation sets were 0.0404 ± 0.033and 0.0404 ± 0.030030, respectively. The normalized RMSE were 0.06 and 0.0505, respectively. The MAEs and RMSEs estimated by the GLM, Tobit and CLAD method was 0.06 and 0.07, 0.06 and 0.07 and 0.05 and 0.06, respectively, which offered no advantage in terms of estimation errors over the OLS models.

Table 4 Performance in trimmed model according to EQ-5D quartile in the derivation and validation sets

We not only examined accuracy of the model but also the distributions of the predicted scores. The observed mean value of the SF-6D score was similar to the predicted SF-6Dindexes of both sets (Table 4). Both sets reported lower variability across predicted utility values with similar SDs at 67 % of the magnitude of the observed SF-6D scores. In both data sets, the 25th percentile and median predicted values were overestimated, but 75th percentile predicted values were underestimated. A plot of observed SF-6D index versus predicted utility scores from the trimmed model indicates that the model fits the data well in both the derivation and validation sets (Fig. 1); the Pearson correlation coefficients were 0.70and 0.7575, respectively.

Fig. 1
figure 1

Scatter plot of predicted values based on the trimmed model parameters versus the observed SF-6D values. A perfect fit is indicated by the 45° reference dotted-line. The blue line is the shown as the liner fit line

External validation

The MAEs estimated by equation 2, 3, 4 and 5 were 0.11, 0.13, 0.11 and 0.10, and RMSE were 0.14, 0.17, 0.14 and 0.13, respectively. When the algorithm mapped our total scores to SF-6D utilities was used (Table 3), the MAE and RMSE were 0.07 and 0.08, respectively.

Discussion

This study explored an algorithm for linking the NDI to the SF-6D using regression equations to predict SF-6D utility values in a data set of patients with CNP. We found that in settings where the NDI, but not SF-6D, is collected, estimating the SF-6D from the NDI score using an algorithm appears useful. The trimmed model with recreation item as explanatory variables has relatively lower MAE and RMSE in comparison with the full model. We noticed that the mean age of our CNP patients were not old (Table 1), recreation might have a considerable impact on their health quality of life because it has an indispensable role in their life [28]. Other mapping studies also found that recreation could notably affect the utility value [29, 30]. This finding indicates that any treatment of improving the recreation might be appreciated in patients with CNP. To the best of our knowledge, the current analysis is the first study to develop the mapping algorithm from NDI items to the SF-6D in patients with CNP; this gives our predictive model the advantage of being applicable to Chinese patients with CNP. Because of the significant loss of data when derived method was used, health preference should be measured directly. However, mapping from condition-specific measure was also reasonable because the mapped score has been pone to have greater extent of sensitivity and responsiveness in patients with different patient populations [31].

Comparing predictions of the preference-based scores using NDI items suggested that a fairly good performance was found for the SF-6D in both the derivation (MAE = 0.0404, RMSE = 0.0606) and validation sets (MAE = 0.0404, RMSE = 0.0505), although the adjusted R2 was 0.49and 0.5656, respectively. The performance of validation sets was relatively better than derivation sets, which might be caused by the stronger convergence of former comparing with latter. Brazier JE and colleagues found that the fit of an algorithm mapping a condition-specific instrument to SF-6D is variable because the values of R2 ranged from 0.17 to 0.51 in the other studies. Because the purpose of a mapping algorithm is to predict health state utility values from other datasets, however, the predictive accuracy of the model should be examined [32]. R2 and adjusted R2 used as measures of explanatory power do not have enough information for testing model performance of a mapping algorithm; they also do not indicate if the mapping algorithm is suitable for the entire dataset [6]. Consequently, the quality of performance of the models used in this mapping analysis is likely to be in line with the typical model performance found in other mapping studies.

Similar to other studies [3335], our finding indicates that the mapping equations overestimated the utilities for the severe health states, whereas they underestimated those for the mild health states (Fig. 1). The final mapping model resulted in the best predicted performance with utility scores ranging from 0.45 to 0.56 (Table 2). It is unknown why this is the case, but it lies in other factors which were sensitive to a patient’s quality of life that may be captured in health-preference instruments but may not be captured in NDI. Because almost half of the enrolled patients fell into this category when they were first diagnosed with CNP, it was expected that the model would perform well in predicting the utilities of these patients. Nevertheless, because of the low MAE and RMSE values, the final model could be recommended as an acceptable method for estimating utilities from the NDI responses for use in a cost-utility study.

The previous two studies translated NDI total scores into SF-6D indexes by using linear regression modeling [13, 14]. Our cohort had similar overall mean SF-6D indexes and NDI total scores in comparison with the sample reported by Richardson SS and colleagues (SF-6D at 0.49 and NDI total scores at 28.54), which was notably different from the data published by Carreon LY and colleagues (SF-6D at 0.67 and NDI total scores at 14.55), possibly due to the study setting and population. Their results showed that the correlations between NDI and SF-6D utility scores were strong and statistically significant, which provides some face validity to the relationships with the SF-6D observed in our study. However, external validation found that the MAEs and RMSEs estimated by using their equation were higher than ours. One of the potential reason is the different characteristics of the patient cohorts, including the ethnic and diseases Another important reason might be the different predictors because we found that model using the total score as the predictor had a relatively high MAE and RMSE in comparison with model using the individual items.

No known studies have compared and mapped the individual items of NDI and SF-6D in patients with CNP, but a recent study by Carreon LY and colleagues compared the NDI with the EQ-5D-3 L in patients with neck and/or upper-extremity complaints [36]. They found no significant relationship between the EQ-5D-3 L and the NDI to allow for a valid estimation of EQ-5D-3 L indexes from the NDI using regression modeling, although the model using the individual NDI items had an R2 of 0.46 and an RMSE of 0.172. The authors think the reason is that the different descriptive components of the EQ-5D-3 L and the NDI are used for measuring very different constructs. Each dimension of EQ-5D-5 L has 5 levels instead of 3 [37], which might establish a stronger and more robust relationship with the NDI, allowing for prediction of the EQ-5D indexes from the NDI.

There are some weaknesses in this study. First, the final model equation might not generalize as well to the NDI scores of the patients with neck disease other than CNP because the external validation found the performance of the equations derived from the different disease cohort was different. How well the model will generalize to other morbidities related to neck pain should be considered, and validation is necessary. Second, for the purpose of the present analysis, we currently do not have enough information regarding other health outcomes or comorbidities to analyze in the models, such as the severity of the CNP due to the small sample size. Using these data as covariates would improve the performance of the models. Future work can include more comprehensive demographic data from patient records. Third, due to the cross-sectional nature of the dataset used in the present study, it is unclear whether the predictive model reported in this study will change over time. No other independent dataset with observations on both SF-6D and NDI could be used to evaluate the external validity of the mapping algorithms reported by this study. There’s also the point that since the Chinese specific scoring algorithm for SF-6D is not available yet, the original UK scoring algorithm has been employed in the study, which is the most widely used methods. The differences of the background characteristics in the two populations might have some influences on the results. Finally, Over-/under-prediction on the bottom/top SF-6D utilities might limits the wide usage of the algorithm, which might be more suitable in CNP patients with moderate pain. Hence, future analyses are necessary to test the accuracy of our methods by using data obtained elsewhere [4].

Conclusions

In conclusion, the statistical performance of the final models demonstrated that it is possible to estimate health state utility values for SF-6D from NDI. Due to the limitations in the current analysis, further research might be required to update the model and examine the performance in other ethnic populations.