Background

For over two decades, patient-reported outcome surveys have been developed and licensed worldwide for use in population-based quality of life outcomes research [1]. In the United States, the Centers for Medicare & Medicaid Services (CMS) have fielded the Medicare Health Outcomes Survey (MHOS) to track the functional health and well-being (often referred to as health status or health-related quality of life [HRQoL]) of Medicare beneficiaries enrolled in health plans sponsored by Medicare Advantage organizations (MAOs) [2]. The MHOS has been administered annually since 1998 to a random sample of Medicare beneficiaries with a follow-up sent to those who continue in the plan 2 years later. There are three eras in the types of instruments used to assess HRQoL. For cohorts 1–6 (baseline survey 1998–2003), the Short Form Health Survey-36® version 1.0 (SF-36®) [3] was used for both baseline and follow-up surveys; for cohorts 7 and 8 (baseline surveys in 2004–2005), the SF-36 was used at baseline and the Veterans RAND 12 Item Health Survey (VR-12) [4] at follow-up; and for cohorts 9 until the present (baseline survey 2006-present), the VR-12 was used at both baseline and follow-up [5]. CMS currently uses the MHOS to assess differences in patient-reported outcomes among MAOs, and such differences comprise one component of the Medicare Star Rating system [6].

The MHOS design poses challenges to researchers and policy makers interested in the use of these outcome measures over cohorts that cross the different eras. Such analyses could involve combining observations from different cohorts to improve statistical power or assessing trends over time. This is especially true in oncology. For the purpose of improving outcome surveillance, the National Cancer Institute launched a new data linkage in 2010 that brings together data from the MHOS with patient-level cancer registry data from the Surveillance, Epidemiology, and End Results (SEER) Program [7,8,9,10].

The SF-36 includes eight scales: physical functioning (PF), vitality (VT), bodily pain (BP), general health perceptions (GH), role limitations due to physical functioning (RP), role limitations due to emotional functioning (RE), social functioning (SF), and mental health (MH) [11]. The VR-12, a derivative of the Veterans Rand-36/SF-36 questionnaires, contains 12 representative items from these eight scales, with modifications to four items representing “role functioning” (2 of which are related to physical health and 2 related to emotional problems, with yes/no response choices for the SF-36 and 5 Likert response choices for the VR-12) [12]. Scales of the SF-36 comprise 2–10 items compared to 1 or 2 items for the VR-12 scales. In addition, both instruments have a physical and a mental health composite summary score (PCS and MCS) that use all 8 scales in their respective computation [13].

The usual methods of calculating VR-12 scale scores (i.e., by linearly transforming the item response choices to a scale of 0–100) do not precisely align with the corresponding scale scores from the SF-36 due to several factors. One factor is that while there are 1 or 2 questions in common between corresponding SF-36 and VR-12 scales, additional questions that comprise SF-36 but not VR-12 scales may be perceived and responded to in a somewhat different manner than the questions in common [14]. Matching of the VR-12 items and scales to those of the SF-36 is also affected by the context (i.e., instrument length, item positioning) in which the questions appear [15]. In a 12-item format, response patterns can shift in comparison with a response to the same question in an embedded (36-question) format. For example, our preliminary results found that the respondents to the VR-12 report having somewhat more energy on the “energy” item compared with similar respondents to the SF-36, even though the energy item is unchanged. Another factor that creates problems in matching the VR-12 scale scores to those of the SF-36 is the previously noted difference in item response choices for the role items [5, 12].

Additionally, the MHOS has other facets that complicate the analyses designed to match VR-12 items and scales to SF-36 scales. These issues are related to the changes in the cohorts over time between the SF-36 and the VR-12 era surveys. These “sample evolution problems” include issues such as changes in the types of MAOs in the survey; year-to-year differences in administration (e.g., proxy responses, foreign language forms, and the use of telephone administration); and changes in population average disease severity.

Our main goal for this paper was to rescore and thus bridge the VR-12 items and scales to match SF-36 scales. The rescored scales need to be applicable across a broad array of potential MHOS subsamples that could differ by their expected mean values of health status and scale scores. We also developed a rescoring algorithm for SF-36 items so that each item could best represent the full scale. This study was motivated by the needs of researchers using SEER-MHOS data to assess scale scores across cohorts that span the SF-36 and VR-12 eras. It was also designed to have general applicability for other types of studies outside of the oncology area that involve MHOS SF-36 and VR-12 data.

Methods

Data source

We included respondents aged 65 or older in MHOS cohorts 1–12 (1998–2011) who returned/mailed, self-reported surveys in English.

Overview of analytic approach

Our overarching goal was to create algorithms for matching the VR-12 items and scales to SF-36 scales that could be used, without further customization, across a broad array of possible MHOS subsamples. Such subsamples could differ, for example, by respondent characteristics, such as age, gender, number, and type of medical conditions, which in turn, would be reflected by differences in the expected mean values of health status and scale scores. An alternative to such a unified algorithm would be to develop an approach where the rescoring of VR-12 items would be dependent on the characteristics of each subsample. Additionally, it would make it difficult to compare HRQoL scores among different studies.

One of the challenges in creating a unified rescoring algorithm is that a particular SF-36 item, scored in a “simple” manner, does not precisely match with the score of the corresponding SF-36 scale. In this paper, we use the term “simple” scoring to refer to the mostly linearly scaling method described by Ware for the SF-36 [16] or to the purely linear 0–100 scaling for the VR-12. Additionally, the difference in score (both in magnitude and sign in this case) between the item and scale differs by age group [14]. Similar issues would also arise for other respondent characteristics, so that a unified matching algorithm would have to optimize the match between items and scales taking into account a range of respondent characteristics. The issue relative to embedded SF-36 items and their associated scales is a similar problem in matching the VR-12 items to the SF-36 scales.

Without knowing how the items will actually be used in the context of SEER-MHOS linked data, the property we most desire for item scoring is to produce an estimate with the best possible external validity. This implies that when the rescored item is averaged across any particular MHOS subsample, possibly a subsample that substantially differs in characteristics and health status from the overall MHOS mean, the rescored item should produce the same mean score as its associated full SF-36 scale to the best possible extent.

The overall task of developing the scoring algorithm was divided into two main sections: (1) Matching scores of individual “embedded” SF-36 items to their corresponding scales, and (2) Matching VR-12 items and scales to the corresponding SF-36 scales (see Table 1). Our reasons for pursuing the first of these tasks were twofold. First, we wanted to determine the feasibility of our approach (that would also be used in a modified form for the second task) using a sample in which we did not have to be concerned about “sample evolution,” as previously described. Second, SF-36 surveys not infrequently have one or more items with missing responses. A typical approach to calculating scales when items are missing from that scale is to average the remaining items scored mostly on a 0–100 scale [17], as long as at least half of the items have been scored. By rescoring individual items of each scale so that they best represent the scale calculated from all items, the results of the first task could permit a more accurate calculation of scale scores when items were missing and would even allow for calculation of the scale if only one item from a scale was present. This would allow us to use survey responses from a high percentage of respondents for matching the VR-12 scales to those of the SF-36.

Table 1 Crosswalk of individual items in common between and unique to the MOS SF-36 and the VR-12

Task 1: matching embedded SF-36 items to their associated SF-36 scales

For matching embedded SF-36 items to their associated SF-36 scales, we used the baseline surveys from cohorts 1–8 to define MHOS subsamples and the follow-up surveys from cohorts 1–6 in cases where responses for both the item and the corresponding SF-36 scale were complete for all subsample members.

The MHOS contains a variety of potential subsamples that vary by health status. If the subsamples are defined on the basis of the scale and the item, the results will be distorted, but we can find “independently defined” subsamples with other correlated variables. For the MHOS, these include age, gender, and responses to MHOS questions dealing with chronic conditions, physical symptoms (such as arthritis pain, shortness of breath, chest pain), and mental symptoms. Using these variables to describe a distance, we used a k means clustering algorithm to solve for M = 200 “clusters” (subsamples) plus one cluster that had missing data on one or more of these variables [18]. The cluster of responses was used to deal with errors in the independent variables.

Within each of the 201 cluster-defined subsamples, the difference between the item score and the scale score was used as a measure of how well the item and scale agree. If one averages the squared differences, weighted by subsample size and takes the square root of that quantity, the result is a measure of criterion validity, which we have termed cluster-weighted “extensibility.” An extensibility value can be used in a way similar to a standard error of the mean [19]. The reason that this statistic is central to our approach (in both this and the second task of matching VR-12 items to SF-36 scales) is that we have no way to predict future use of the MHOS data and what subsamples might be chosen. We need a robust scoring methodology that can be applied to many circumstances with credible accuracy and a way to characterize the limits of such accuracy. The premise of this methodology is that we do not seek to estimate a scale value of a single respondent to the MHOS from one item but rather, a method that will estimate, from any subsample that we might encounter, the mean scale values from the mean values of a single item.

To evaluate and compare extensibility among rescoring methods, we conducted the following approach. Within each cluster and across all members of the cluster, we computed the average value of the scale based on the simple scoring method applied to all items from the scale, and the average value of the “estimated” scale based on a chosen scoring method applied to a single survey item from that scale. For a given cluster ‘m,’ we obtain the average of scale (\({\overline {X} _m}\)) across cluster members scored in the simple manner [20].

Summary of notations

\({\overline {X} _m}\)—average of SF-36 version 1 scale scored in simple manner within cluster m

\({\overline {x}_m}\)—average of new rescored value of single item within cluster m

xn—new rescored value of item for an individual n within a cluster

We used the index to refer to an item within the associated scale with the goal of judging how well a rescored value of such an item best matches the value of the scale. Applying the new scoring method to a single item, we then obtain an average estimate (\({\overline {x} _m}\)) across the same cluster members. The squared difference between the two \({\left( {{{\overline {X}}_m} - {{\overline {x} }_m}} \right)^2}\) describes how far apart they are. The mean squared difference across all clusters describes how well the estimated scale values, based on the new scoring method applied to a single scale item, succeeded at matching the values based on the simple method derived from all scale items, and the square root of that quantity is what we have termed “extensibility.” A low value is desirable; that is, if the scale scores based on the new scoring method and the simple methods always agree, then that new method is extensible. The root-mean-squared difference describes the sum of statistical and non-statistical errors. Non-statistical errors come from a biased estimate, whereas statistical errors are due to random sampling variation. For rescoring methods that produce individual estimates, the statistical errors are described by the standard error, as calculated using standard formulas for the variance of a mean.

Our goal was to rescore the individual items so that they optimally represent the scale within the cluster. Cluster scoring methods assume that each response of each item within cluster m for individual n will take on some value xn. For example, in the simple scoring scheme of an item with five response choices, xn can have values of 0, 25, 50, 75, and 100. In the process of matching items to scales, the values of these response choices will be rescored. Within each cluster m, our choice of how to rescore each response for each item will determine the \({\overline {x} _m}\), which is the mean of the scored item within that cluster. If cluster m has Nm respondents and xn is the rescored value of an item for an individual n within the cluster:

$${\overline {x} _m}=~\frac{1}{{{N_m}}}~\mathop \sum \limits_{{n=1}}^{{{N_m}}} {x_n}.$$

This will be compared to \({\overline {X} _m}\) that is the mean score, based on the simple scoring method, of the SF-36 scale that encompasses the item used to calculate \({\overline {x} _m}\), which is based on the new scoring method. The goal of this scoring method is to find optimal values for each response of each item for each cluster m so that the subsample size-weighted mean of \({\left( {{{\overline {X}}_m} - {{\overline {x} }_m}} \right)^2}\) for each item across all clusters is minimized. If N is the total number of respondents across all of the clusters (201 clusters in this case), Nm is the number of respondents with cluster m, and we weight \({\left( {{{\overline {X}}_m} - {{\overline {x} }_m}} \right)^2}\) relative to the mean cluster size N/201, then for each item, we seek to minimize the quantity.

$$S=\frac{1}{{201}}\sum\limits_{{m=1}}^{{201}} {\frac{{{N_m}}}{{N/201}}} {\left( {{{\overline {X} }_m} - {{\overline {x} }_m}} \right)^2}=\sum\limits_{{m=1}}^{{201}} {\frac{{{N_m}}}{N}} {\left( {{{\overline {X} }_m} - {{\overline {x} }_m}} \right)^2}.$$

The square root of S is what we term cluster-weighted extensibility. In order to minimize these values, we used a non-linear least squares algorithm (“nl” in STATA [21]), which minimizes the sum of squared differences of a non-linear expression using a pseudo-regression technique that is fitted based on a modified Newton–Raphson algorithm [22]. The item values were parameterized in a way that forced the solution to be monotonic; for example, for the items in the physical function scale, the score for “limited a lot” has to be less than the score for “limited a little” in order to avoid unreasonable solutions corresponding to local minima of the objective function (the extensibility statistic). We also explored multiple starting values to assure ourselves we had the best local minimum extensibility. When extensibility values are reported the expected squared error due to random sampling is subtracted from “S.” The cluster scoring is available upon request from the senior author.

Task 2: matching VR-12 items to the corresponding SF-36 scales

For the second task, matching of VR-12 items to the SF-36 scales required a somewhat different approach from that used to match extracted SF-36 items to their corresponding scales.

To define clusters for this particular analysis, our intent was to use variables that would likely have a similar meaning in cohorts 7 and 8 relative to cohorts 1–6, and on that basis, we chose age, gender, and baseline SF-36 scores.

We used the rescored SF-36 items from the first task to calculate SF-36 scales for the baseline and follow-up surveys for cohorts 1–6 and the baseline survey for cohorts 7–8.

SF-36 scales and corresponding VR-12 items were considered one at a time. Using Classification and Regression Trees (CART), we modeled follow-up scores in cohorts 1–6 for each scale based on the baseline SF-36, age, and gender. From this we obtained 100 clusters where the predicted SF-36 scale score was similar. We averaged the follow-up SF-36 score for each scale within each cluster using data that spanned cohorts 1–6.

We applied the cluster definitions obtained in cohorts 1–6 to individuals in cohorts 7 and 8 who had both baseline and follow-up surveys. In these cohorts, SF-36 was used at baseline and the VR-12 at follow-up. The mean expected SF-36 follow-up score for each scale within each cluster for cohorts 7–8 are those that were calculated based on cohorts 1–6 follow-up SF-36 scores.

From the above analyses, we have 100 clusters of respondents to the follow-up surveys in cohorts 7 and 8 in which we have an expected score for each SF-36 scale. We also know the observed VR-12 items responses to the follow-up survey in cohorts 7–8 for each individual within each cluster. The overall strategy was to rescore the response choices for each VR-12 item such that for each of the 100 clusters, the VR-12 item score across respondents within a given cluster best matches the expected corresponding SF-36 scale score.

In order to find the best match between VR-12 items and SF-36 scales, we again sought to minimize the cluster size-weighted mean square difference between the item and scale, as described in Task 1. To accomplish this, we again used the previously described non-linear least squares routine “nl” in Stata.

Evaluation of the new scoring algorithm

Our methodology was based on three kinds of calibrations/predictions or “steps.” Step 1 was part of Task 1, and steps 2 and 3 are part of Task 2.

Step 1: SF-36 items were scored/weighted to match SF-36 scales. This matching was done cross-sectionally using cohort 1–8 baseline data and cohort 1–6 follow-up data

To evaluate this calculation, we compared the extensibility obtained using the equal interval method of scoring with that obtained after rescoring the SF-36 items to minimize extensibility across clusters.

Step 2: using CART, baseline SF-36, age, and gender in cohorts 1–6 was used to predict follow-up SF-36 scale scores. The prediction algorithms were then applied to develop predicted follow-up SF-36 scale scores in cohort 7–8

We evaluated the accuracy of our algorithm in predicting follow-up scales scores in a different set of cohorts from which the algorithm was developed. In our main analyses, we used data from cohorts 1–6 to predict follow-up scores in cohorts 7–8. To validate this methodology, we used a similar approach in which baseline and follow-up data in cohorts 1–3 were used to predict follow-up scale scores for cohorts 4–6. Specifically, CART analyses in cohorts 1–3 were used to model follow-up scale scores based on baseline SF-36 scores, age, and gender, from which 100 clusters were created. The same cluster specifications from cohorts 1–3 were then applied to the baseline SF-36 data from cohorts 4–6 in order to create predicted SF-36 follow-up scale scores for cohorts 4–6 for each cluster and each scale. Since the actual SF-36 follow-up scale scores for cohorts 4–6 are known, they can be compared to predicted scores.

Step 3: follow-up VR-12 items are scored/weighted to match the predicted follow-up SF-36 scale scores (from step #2) in cohort 7–8

We examined how the new scoring algorithm compared to the “simple” scoring method in matching VR-12 items to the corresponding SF-36 scales using extensibility as the criterion validity. Most of the items in the VR-12 are derived from the SF-36 and some could be described as VR-12 items. In scoring SF-36 scales in the simple manner, only those cases in which there were responses to all items within a scale were used. Four of the VR-12 scales comprise single items, and we also matched the 2-item VR-12 scales to those of the SF-36 by averaging the two items comprising those scales. Since the rescoring algorithm might be applied to data subsets quite different from the complete dataset for MHOS respondents, we used a new set of clusters to display comparisons between simply scaled and rescaled items, and we did not weigh the extensibility calculation by cluster size.

We also evaluated Step 3 to assess whether the extensibility results obtained are sensitive to the sample used to obtain predicted follow-up scores for cohorts 7 and 8 and the sample to which the rescoring algorithm is applied. For this analysis, we divided the sample comprising cohorts 1–8 into two parts, a larger part consisting of two-thirds of the data (“estimation sample”), and a smaller part with the remaining one-third (“test sample”). Using the methods described in Steps 2 and 3 (Task 2) of our main analyses, but applied to only the estimation sample, clusters were created in cohorts 1–6 and predicted follow-up SF-36 scales scores were calculated for cohorts 7–8 based on follow-up results in cohorts 1–6. In the estimation sample, VR-12 items for follow-up cohorts 7–8 were rescored to best match the predicted SF-36 scale scores. In the test sample, the same methodology was used to create new clusters and new predicted SF-36 scale scores for cohorts 7–8. We then assessed how well the VR-12 rescoring algorithm developed in the estimation sample worked in the test sample in matching the VR-12 item scores to the expected SF-36 scales scores. In particular, we examined the correspondence of extensibility between estimation and test samples with regard to the matching of VR-12 items to SF-36 scales.

Results

Profile of the MHOS cohorts

Table 2 is the profile of the analytic sample used in the analyses. There were 570,459 individuals in the SF-36 era (cohorts 1–6), 119,543 individuals in the transition era (cohorts 7–8), and 452,155 individuals in the VR-12 era (cohorts 9–12).

Table 2 Descriptive profile of Medicare Health Outcomes Survey (MHOS) cohorts

Rescoring of SF-36 items and extensibility of rescored SF-36 embedded items

Table 3 gives the SF-36 item scorings for the questions related to “moderate activities” and “climbing several flights of stairs” in the PF scale based on our analytic sample of elderly MHOS, English-language, mailed self-report surveys with all 10 PF items answered using: (1) the simple prorated scale values (0, 50, and 100), and (2) the cluster scoring method. In addition, Table A-1 of Supplemental Material 1 contains a table of the rescored items of the SF-36 that are substituted for its full-scale counterpart if scored in the most extensible manner, using the cluster method.

Table 3 Score equivalents of “moderate activity” and “climbing several flights of stairs” items (from the SF-36) for two scoring methods

Rescoring the VR-12 items and extensibility of the rescored VR-12 items

Table 4 contains unweighted extensibility of both rescored and simply scored VR-12 items and scales based on 100 clusters of roughly equal sample sizes. (A brief item description associated with each item abbreviation is contained in Table A-2 of Supplemental Material 1 along with rescored item responses for each of the 12 items.) For example, the rescored VR-12 PF2 item (Moderate Activity) is slightly different from the SF-36 PF scale such that the extensibility or the root-mean-square difference is 2.1 points. For comparison, Table 4 also includes the unweighted extensibility calculated with the simple scoring of the VR-12, excluding the “role” items and scales. In all cases where comparisons are meaningful, the extensibility of the rescored items and scales was better (denoted by lower values) than with simple scoring. The “role” items were excluded because it is not meaningful to compare such items as simply scored due to the difference in responses choices between the SF-36 and VR-12 (yes/no vs. 5-point Likert scale). Supplemental Material 2 contains additional detailed information on how extensibility can be used and its applications for the researcher.

Table 4 Extensibilitya (unweighted) of VR-12 items and scales

Figure 1a–h provides further illustration of the effect of rescoring on the association between SF-36 and VR-12 scales. For each of the 100 clusters derived from CART, the x-axis represents the “expected” follow-up SF-36 scale scores for cohorts 7 and 8 (on patients with complete survey data), based on the results from cohorts 1–6. The y-axis represents the observed follow-up VR-12 scale scores for cohorts 7 and 8 (on patients with complete and incomplete survey data), scored in both the simply scored and rescored manner. As previously noted, our methodology sought to rescore the VR-12 such that the observed follow-up VR-12 scores in cohorts 7 and 8 optimally matched the expected SF-36 scores across all clusters. If the match was ideal, all cluster points would fall along the line of identity. The data points representing the rescored VR-12 items fall relatively close to the line of identity or “ideal fit.” For the PF, BP, GH, VT, SF, and MH scales, data points representing the simple scoring method for the VR-12 are also displayed. Compared to the points representing rescored values, the simply scored data points appear to deviate from the line of ideal fit to a greater degree. This is reflected in the differences in extensibility shown in Table 4.

Fig. 1
figure 1

ah For the 100 clusters applied to baseline cohorts 7 and 8, correlation between the estimated scores of the follow-up SF-36 scale and the scale scores for the VR-12. For the PF, BP, GH, VT, SF, and MH scales, the VR-12 scores are calculated by two different methods, simple and rescored. For the RP and RE scales, only the rescored VR-12 scales are displayed for the reasons described in the “Results” section text for Table 4. (Color figure online)

Evaluation of the accuracy of predicted scale scores

As noted in Methods, we also assessed the accuracy of the algorithm used in predicting follow-up scale scores in cohorts 7–8 by using data in cohorts 1–3 to predict follow-up scale scores in cohorts 4–6 where actual scale scores were known. The results are presented in Figures A1a–A1h in supplemental material #3. The mean predicted SF-36 scores match quite well with the mean actual SF-36 scale scores across the clusters for six of the eight SF-36 scales, with somewhat less tight matching for the two role scales. The differences from the line of identity are consistent with the extensibilities observed from the matching of VR-12 items and scales to SF-36 scales. The less tight results for the role scales, while still following the line y = x, suggest that the yes/no responses are more difficult to accurately model, and this helps to explain in part why it is hard to match the responses of the VR-12 to these role scales.

Extensibility of estimation and test samples

In addition to the above results based on the full sample, we also calculated extensibility for the matching between VR-12 items and SF-36 scales based on a split estimation and test sample. Table 5 shows the correspondence of extensibility between estimation and test samples, which are within and around 0.6 of one another. When the extensibility is large, it may mean that the item content shifted a little in the VR-12. This would be true for the role items, which changed from yes–no in the SF-36 to “all of the time” to “none of the time” response choices in a 5-point Likert format in the VR-12. The general health item extensibility probably reflects the inability of the single item to represent the rest of the general health scale well; the SF-36 version has five items and the other four have a different response scale than the global item in the VR-12. Other items had acceptably low extensibility. These findings were generally similar to those from the original full MHOS sample (Table 4).

Table 5 Correspondence of extensibility values between estimation and test samples

Items and scales across MHOS cohorts using the simple versus new scoring algorithm

Figure 2 illustrates the comparison of the simple compared to rescored items and scales across the 12 MHOS cohorts using PF as an example (other figures for other scales are available on request). The values in Fig. 2a, b were not adjusted for potential differences among cohorts in characteristics of the respondents and MAOs and other aspects of sample evolution. Figure 2a represents all those in the analytic sample who responded to the baseline surveys, and Fig. 2b represents those in the analytic sample who responded to the 2-year follow-up surveys. The rescoring was applied to all items in both the SF-36 and VR-12 surveys. The lines representing rescored items (PF2 and PF4) and the rescored PF scale track closely to one another. Although they differ slightly due to differences in the method of scoring, the rescored version of the PF scale (red line) also tracks closely to the simple scoring of the scale (green line) during the SF-36 era. However, during the VR-12 era, the green line (simple scoring algorithm) substantially deviates from the other lines, while the red line (representing the rescored scoring algorithm) appears to track more smoothly between SF-36 and VR-12 era surveys.

Fig. 2
figure 2

The green line in ac represents the original method of calculating the SF-36 and VR-12 scale scores for PF (physical functioning), while the red line represents the rescored method. In a, b, the blue and black lines represent the rescored versions of the PF2 (moderate activities) and PF4 (climbing several flights of stairs) questions, respectively. PF2: rescored “moderate activities” question of physical function scale. PF4: rescored “climbing general flights of stairs” question of physical function scale. Mean of PF: mean physical function scale calculated using rescored versions of each of the ten SF-36 and two VR-12 component questions. a The line for “Mean of PF” is based on 10 items in cohort 1–8 but only on two items in subsequent cohorts. b The line for “Mean of PF” is based on 10 items in cohort 1–6 but only on two items in subsequent cohorts. Mean of PF_orig: mean of physical function scale calculated using the original prorated response values on a scale from 0 to 100 instead of the rescored versions. Mean of diff_PF: mean of the difference of physical function scale using rescored versions of each of the ten SF-36 and two VR-12 component questions. Mean of diff_PF_orig: mean of the difference of physical function scale calculated using the original prorated response values on a scale from 0 to 100 instead of the rescored versions. (Color figure online)

Figure 2c represents the difference in scores between the baseline and follow-up surveys. During the era that the SF-36 was the baseline and follow-up survey (cohorts 1–6), the rescored and simple scoring lines track very closely to one another. For cohorts 7 and 8, where the SF-36 was the baseline and VR-12 the follow-up survey, there is substantial deviation between the two lines, reflecting the differences caused by rescoring of the VR-12. For cohorts 9–12, where the VR-12 was used for both baseline and follow-up, the red and green lines are again closer together.

Discussion

Our study used a novel methodology to create an algorithm for rescoring the scales from the VR-12 to numerically match those of the SF-36. Our cluster approach is aimed for group level interpretation. This method is applicable across a broad array of potential samples of individuals that differ in health status due to factors such as medical conditions, symptoms, and demographic characteristics. The extensibility values related to the matching of VR-12 items to SF-36 scales were generally credible, although somewhat less optimal for both physical and emotional role items and scales and the general health item. The scales of the VR-12 and SF-36 contain granular health status information that can be used along with the summary scores, physical PCS, and mental MCS, and the knowledge of information pertaining to scales may be of additional use in understanding how various factors affect the health status of patients [23].

The new bridging algorithm has important implications for missing data. One can use the embedded item rescoring based on best extensibility criteria. They provide reasonable estimates even if only one item of a 10-item scale is available. If more than one item is available, the results can be averaged across those present. If only a few items are missing the Modified Regression Estimation (MRE) accounts for the missing information by estimating the score based on the items that are available [23, 24]. Historically, two other methodologic approaches have been used to address this problem: regression and item response theory [25]. However, they have significant weakness. The regression answer is sensitive to the nature of the estimation sample. Since a particular item, physical measurement, or scale has response error, both the dependent and independent variable of the regression have error in them, so at best there is a difficult errors-in-variables problem and at worst there is an endogeneity problem if both variables come from the same samples and they are not assessed independently. The impact of error in the dependent variable is to bias the regression coefficient toward 0, so these answers suffer from regression to the mean bias. The item response theory answer solves the regression-to-the-mean problem, but there are other problems; in particular, what to do with values that are at the ceiling or the floor? In some cases this is not a severe problem, but in many cases we are trying to replace a whole scale such as the SF-36 vitality scale with a single item such as the Energy item. In these cases, the treatment of the floor and ceiling values is either very arbitrary, or (in some IRT implementations) dependent on the target data. The latter answer is not satisfactory because it creates a definition that varies from one target dataset to another, so datasets cannot be compared.

Several factors may limit the results of this study. First, while we were able to identify a methodology to minimize extensibility of items and scales of the VR-12, the magnitude of extensibility can be a limiting factor in the ability to detect differences in scale scores when MHOS cohorts that span the SF-36 and VR-12 eras are combined. Extensibility was noticeably higher for the general health item as well as the role physical and role emotional scales. For the “role” items, the higher extensibility was likely due to the difference in the number of response choices between the SF-36 and VR-12 surveys. Second, in combining health status scores across cohorts, it is important to take such sample evolution into account. In this report, we did not attempt to address the approaches that would be required to adjust for the variation in attributes, survey-related factors, changes in criteria over time defining disease diagnoses, and changes to treatment among cohorts and their members. Future studies can evaluate these factors. Third, our algorithm was developed from the MHOS in those 65 years of age or greater, it is possible that VR-12 surveys in a different setting with younger individuals could produce somewhat different results. Fourth, we did not revise the scoring of PCS and MCS because we did not want to compete with a previous solution for PCS and MCS commissioned by the National Committee for Quality Assurance (NCQA) on behalf of CMS.

In summary, we created a new bridging algorithm for rescoring VR-12 items and scales to match corresponding SF-36 scale scores that improved upon the “simple” method of scoring items and scales. In particular, through the use of “cluster” subsamples of the MHOS data and the application of extensibility, the algorithm that was developed should be applicable across a wide range of MHOS samples for future studies.