Introduction

Response shift (RS), a change in an individual’s internal standards, values, or conceptualizations of health and well-being, has been acknowledged as an important phenomenon that influences the assessment of longitudinal changes in health-related quality of life (HRQOL) studies [40, 41]. While RS is increasingly being investigated in longitudinal studies that adopted HRQOL instruments, this field of research has witnessed an increased number of advanced statistical methods such as structural equation modeling [30, 31], latent trajectory model [28]; classification and regression trees [3, 19, 20], and discriminant analysis and logistic regression analysis [23, 42] for RS detection in patient-reported HRQOL data.

Although these statistical procedures rest on different underlying model assumptions, they are all sensitive to the data analytic conditions that are inherent in longitudinal HRQOL data such as non-normality, missing data, and measurement error. Missing values arise because study participants drop out or are absent at intermittent occasions throughout the study [12, 22, 29] and because of questionnaire item non-response. Ignoring missing observations can lead to incorrect inferences about the size of a treatment effect or the magnitude of longitudinal change. Missing data may also result in incorrect conclusions about the presence of RS in longitudinal data.

Although several statistical methods have been developed to handle missing data in HRQOL studies [2, 29], the choice of a method is generally dependent on the pattern of missingness. Little and Rubin [22] summarized these patterns into three missing data mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR arises when the reason for the missingness is not related to any observed covariate or the true unknown values of the missing data. For the MAR mechanism, missing observations are unrelated to the true values of missing data but are associated with one or more observed covariates. Both MCAR and MAR are considered ignorable missing data mechanisms because the distribution of the missing observations is independent of the unobserved data, whereas MNAR is a non-ignorable pattern because missingness is related to unobserved underlying values of the missing data.

Missing data methods include conventional approaches such as complete-case analysis, last-observation-carried-forward, hot-deck imputation, pair-wise deletion, average-available-observations-carried-forward, and mean imputation [9, 18]. Although complete-case analysis has been shown to be valid when the missing observations are MCAR, such analyses may not be valid when the mechanism is MAR or MNAR, resulting in reduced statistical power and biased parameter estimates. However, all of these conventional methods for missing data have been shown to induce bias in the parameter estimates from longitudinal models under MAR or MNAR assumptions. A limitation of complete-case analysis is that it reduces sample size, resulting in a loss of statistical power. The true trajectory of change may be underestimated when the last observation is carried forward to account for missingness in longitudinal data. Moreover, mean imputation may underestimate the variability in the data, resulting in underestimated correlations among questionnaire items or summary scores [9]. Other statistical approaches for dealing with missing observations may be less frequently adopted by researchers because they are computationally intensive. These alternative techniques for parameter estimation include expectation–maximization (EM) imputation, full information maximum likelihood (FIML) imputation, multiple imputation, inverse-probability weighting, pattern mixture, and selection models [1, 9, 17, 18]. While imputation-based methods rest on the assumption that the missing observations are MAR, pattern mixture and selection models assume that the data are MNAR.

Despite the existing range of statistical methods for handling missing data in HRQOL studies, the influence of missing data on the detection of RS in longitudinal HRQOL data is unknown. We hypothesize that the choice of methods for handling missing data can influence the conclusion about the presence of RS in longitudinal HRQOL. More importantly, we argue that, although, imputation methods tend to result in increased sample size, these methods can result in either upward or downward bias in domain correlations and RS effect sizes, which in turn, may affect the statistical power to detect the presence of RS. Therefore, the purpose of this study is to investigate the effect of the choice of methods on conclusions about the presence of reprioritization RS. Data from an existing longitudinal data set of caregivers of stroke patients are used to demonstrate the implementations of these methods. Recommendations about strategies for handling missing data will be developed based on these study results.

Methods

Data Source

Data are from a multicentre observational longitudinal study “Understanding Quality of Life Post-Stroke: A Study of Individuals and Their Caregivers” [26, 27]. This study aimed to understand the effects of stroke severity on HRQOL of stroke survivors and their caregivers over a one-year period post-stroke. Data were collected on 678 stroke patients and their 409 caregivers at 1, 3, 6, and 12 months post-stroke. Study non-participation was low; only 67 individuals who were approached refused to participate (16.3 %).

Data on stroke patients’ and their caregivers’ HRQOL were collected using both generic and stroke-specific measures, including the Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36), Health Utility Index [11, 25], and Stroke Impact Scale [8] and other stroke-specific HRQOL measures such as the Stroke Impact Scale [7, 46]. The SF-36, which is the focus of this study, encompasses eight domains of HRQOL, including role physical, bodily pain, physical functioning, general health, role emotional, mental health, vitality, and social functioning. The domain scores are scaled to range in value from zero (poor health) to 100 (good health).

Severity of stroke was measured in survivors using the 16-item Stroke Impact Scale (SIS-16) [7]. The items encompass four physical functioning domains Scores range in value from zero (poor functioning) to 100 (good function). A score below 75 reflects severe disability in physical functioning, while scores above 75 indicate less-severe disability; these two stroke severity groups were used in the RS analysis. This instrument has been shown to possess good psychometric properties [7]. Data on caregivers’ characteristics such as age, sex, previous experience as a caregiver, number of children, and relationship with patients were also collected at baseline.

Missing data methods

Three commonly adopted missing data methods for longitudinal data were selected for comparison. These were mean imputation, EM imputation, and multiple imputation [15, 32, 38].

Mean imputation

The mean imputation method for longitudinal data replaces missing values on each variable and measurement occasion by the mean of the total sample for each domain at each measurement occasion [15, 32, 38, 43]. While this imputation method is easy to implement, it can underestimate standard errors of parameter estimates, which may lead to erroneous conclusions on tests of significance. Mean imputation can be applied to missing item scores or domain scores. One advantage of item-level mean imputation over domain-level mean imputation is that it maximizes the number of observations. However, there have been few comparisons of the effect that item- and domain-level mean imputation has on study conclusions [13, 20].

EM likelihood-based imputation

Likelihood-based methods, such as EM and FIML, have been developed for handling missing data in cross-sectional data but also extended to longitudinal data [6, 10, 37]. The missing values are predicted based on maximum likelihood estimates from the available data. The EM algorithm is an iterative process that consists of two steps: the expectation (E) step and the maximization (M) step. In the E step, the expectation of the complete data log-likelihood is derived, given the observed data and the estimated parameters from a previous iteration. In the M step, the conditional expectation of the complete data log-likelihood is maximized. The observed data log-likelihood is increased until the algorithm converges. The EM imputation method is advantageous in that it allows the inclusion of the maximum number of observations. The EM method provides unbiased and efficient parameter estimates when data are MCAR or MAR but not when the data are MNAR.

Multiple imputation

For the multiple imputation method, several datasets are created based on random draws from a distribution of plausible missing values, the analytic technique is applied to each dataset, and the relevant parameter or estimate is summarized [14, 39]. This method overcomes the underestimation of standard errors of parameter estimates that arises with single imputation. Three widely used multiple imputation methods are the predictive model method, propensity score method, and Monte Carlo Markov Chain (MCMC) method. While the first two methods are appropriate for monotone missingness, the MCMC method can be applied to non-monotone missingness. A dataset is said to have a monotonic missing pattern when Y ijk , the ith (i = 1, … n) subject’s measurement on the kth (k = 1, …, p) domain variable in the jth (j = 1, … J) group, is missing for an individual implies that all subsequent measurements are missing for the individual. Standard statistical procedures are applied to each of the imputed datasets to estimate the parameters, and the results are combined, as recommended by Rubin [34]. Let Q m be an estimate of the parameter of interest obtained from the mth (m = 1, …, M) imputed dataset and U m is the standard error associated with Q m . The overall parameter estimate is calculated as the average of the parameter estimates from the imputed datasets,

$$ \overline{Q} = \frac{1}{M}\sum\limits_{m = 1}^{M} {Q_{m} } , $$
(1)

while its standard error is calculated as

$$ T = \sqrt {\overline{U} + (1 + \frac{1}{M})B} , $$
(2)

where \( \overline{U} = \frac{1}{M}\sum\nolimits_{m = 1}^{M} {U_{m} } \) is the average of the within-imputation variance, and \( B = \frac{1}{M}\sum\nolimits_{m = 1}^{M} {(Q_{m} - \overline{Q} } )^{2} \) is the between-imputation variance. Previous research has shown that the efficiency of parameter estimates based on multiple imputations increased as the number of imputation increased [34] Although three to five imputations have been found to be sufficient in many applications especially when the proportion of missing data is <30 %, as many as 100 imputations have been recently recommended when the proportion of missing data is more than 30 % [14]. Standard multiple imputation strategies are readily implemented in standard statistical software packages including SAS [16] (SAS Inc 2011). Moreover, even though the multiple imputation model for continuous outcomes rests on the assumption of a multivariate normal distribution, previous research has shown that multiple imputation based on the MCMC method can be robust to non-normality when compared to other methods [24, 37]. As a result, the multiple imputation method for continuous outcomes can also be used to impute missing data on HRQOL items that are binary or ordinal. A rounding strategy is recommended so that the imputed responses can fit the range of values for the binary or ordinal variables [16, 37].

Relative importance measures for reprioritization RS

Statistical tests based on measures of relative importance derived from descriptive discriminant analysis and logistic regression models have been recently developed for detecting RS in longitudinal HRQOL data [23]. These test procedures are based on the premise that there is a statistically significant longitudinal change in the relative importance of the HRQOL domains when reprioritization RS is present in the data (Figs. 1, 2).

Fig. 1
figure 1

Changes in Standardized Discriminant Function Coefficients for SF-36 Domains between Baseline and 6 Months by Missing Data Methods BP SF-36 Bodily pain domain, PF SF-36 Physical functioning domain, RP SF-36 Role physical domain, RE SF-36 Role emotional domain, SF SF-36 Social functioning domain, MH SF-36 Mental health domain, VT SF-36 Vitality domain, and GH SF-36 General health domain

Fig. 2
figure 2

Changes in Standardized Logistic Regression Coefficients for SF-36 Domains between Baseline and 6 Months by Missing Data Methods. BP SF-36 Bodily pain domain, PF SF-36 Physical functioning domain, RP SF-36 Role physical domain, RE SF-36 Role emotional domain, SF SF-36 Social functioning domain, MH SF-36 Mental health domain, VT SF-36 Vitality domain, and GH SF-36 General health domain

Measures of relative importance derived from discriminant analysis or logistic regression models use weights (i.e., variable coefficients) or ranks (i.e., rank order of the coefficients based on magnitude). This approach has been proposed to quantify the relative importance of variables that discriminate between two or more groups defined a priori. These include standardized discriminant function coefficients (SDFC), discriminant ratio coefficients, standardized logistic regression coefficients (SLRC), and Pratt’s index for logistic regression [35, 44, 45]. The first two measures are developed based on discriminant analysis, and the latter two are derived from logistic regression models. These measures have been described in detail elsewhere [35]. In the logistic regression model, the group-membership variable (e.g., stroke severity) is the dependent variable, while independent variables are the domain scores. In contrast, in discriminant analysis, the domains are the dependent variables model, while group-membership variable is the independent variable. To test for reprioritization RS, discriminant analysis or logistic regression models are conducted independently at each measurement occasion. To generate the sampling distribution of the differences in relative importance weights, bootstrap samples in each group were centered by subtracting the group means from the original data. The relative importance weights were then calculated from the centered data at each measurement occasion, and the bootstrap difference in relative importance weights was estimated. For each of the relative importance measures, reprioritization RS was considered present in a domain if the change in signed values of the relative importance weight or for a domain is statistically significant at a pre-specified level of significance (i.e., α = 5 %) [23].

Statistical analysis

We investigated the impact of missing data methods on test procedures based on changes in the relative importance of HRQOL domains in detecting reprioritization RS in longitudinal data. Data from the self-reported HRQOL of stroke caregivers at baseline and month 6 post-stroke are used to demonstrate the implementation of these methods. This analysis focuses on detecting reprioritization RS among the SF-36 domains over the 6-month period post-stroke for caregivers of patients with severe and less-severe stroke. Preliminary descriptive analyses of the demographic characteristics and self-reported SF-36 domain scores for the stroke caregivers were conducted at baseline. To determine the pattern of missing data, patients with complete data at both occasions and those with incomplete data were compared on demographic characteristics and SF-36 domain scores at study baseline using t tests or χ 2 tests. The missing data methods were compared using test procedures based on changes in relative importance weights to detect reprioritization RS among the SF-36 domains.

For the multiple imputation model, each SF-36 domain at baseline and month 6 were included in the imputation model along with age of caregivers, number of children, gender, and relationship with spouse as ancillary variables. Given the apparent arbitrary missing data pattern observed for this dataset, a MCMC multiple imputation model was adopted and developed independently for each stroke severity group [24]. Given that about 40 % of the study participants had missing data on at least one measurement occasion, 100 imputed datasets were created from the multiple imputation model, in keeping with recommendations from previous research [14].

Given that no single relative importance measure has been uniformly recommended, the relative importance weights are derived from more than one relative importance measure, namely: SDFC derived from discriminant analysis and SLRC derived from the logistic regression model. For each measure, discriminant analysis and logistic regression analysis were conducted on data at baseline and 6-month post-baseline. Statistical significance of the changes in the relative importance weights was determined by the empirical bootstrap method, in which the distribution of the test statistic is approximated using B = 500 bootstrap samples, in keeping with previous research recommendations [4, 5]. Reprioritization RS was considered present in a domain if the change in the coefficients derived from each relative importance measure was statistically significant (α = 0.05).

Given that our example data contain missing observation, the assessment of statistical significance may vary depending on the missing data method adopted. For complete-case analysis, relative importance test procedures for RS detection were conducted on the remaining data, after case-wise deletion of observations with missing data. For mean imputation and EM imputation methods, test procedures of relative importance were conducted on the imputed datasets. For the multiple imputation method, the differences in relative importance weights over the two occasions for the original data were estimated by using Rubin’s rule [34] to combine the imputation-specific estimates of differences in relative importance weights. The distribution of the differences in relative importance weights was obtained by repeatedly sampling from the original incomplete data with replacement (i.e., bootstrap) to create B bootstrap datasets. For each bootstrap sample, the differences in relative importance weights were estimated based on multiple imputation of the bootstrap sample. A detailed description of the implementation of relative importance test procedures for multiple imputed dataset is described in “Appendix 1”. All analyses were conducted using SAS version 9.3 [36].

Results

Table 1 describes the demographic characteristics and self-reported SF-36 domain scores for caregivers of survivors with severe and less-severe stroke conditions at study baseline. Of the 409 caregivers, about 344 patients had complete data on the SF-36 domains at baseline, while 281 patients had complete data on all domains at month 6. Spousal caregivers comprised 64.7 % of the sample; 39.9 % of the caregivers were caring for patients with severe stroke conditions. The caregivers were predominantly female.

Table 1 Characteristics of stroke caregivers at study baseline

Of the 409 caregivers recruited into this study, 209 (50.0 %) has complete data on all the SF-36 domains and at all four measurement occasions. At baseline, 344 (84.1 %) of the participants had complete data, while 68.7 % of the participants had complete data at month 6. Further analyses shows that the pattern of missingness was significantly associated with severity of stroke for the patients receiving care; caregivers of patients with less-severe stroke conditions were more likely to have complete data on all four more occasions than caregivers of patients with severe stroke conditions. However, there was no significant association between the likelihood of missing data and the stroke severity after adjusting for caregiver’s relationship with the stroke patients. We therefore assume that the missing data mechanism was MAR.

The results of the comparison of relative importance weight analyses as a function of the missing data method used are presented in Table 2. These analysis results are based on changes in relative importance of the domains between baseline and 6-month follow-up. The magnitudes of changes in SDFCs were largest for role physical domain for complete-case analysis. The vitality domain had the largest changes in SDFC for mean imputed, EM imputed, and multiple imputed datasets.

Table 2 Changes in relative importance weights for SF-36 domains by missing data method

When the complete-case analysis method was adopted, there was no evidence of statistical significant change in the SDFCs over the two measurement occasions for any of the SF-36 domains. But there were statistically significant changes in SDFC on physical functioning and vitality domains when mean imputation method was adopted. For EM imputation methods, the magnitude of change in SDFCs over time was statistically significant for the vitality domain only. There were statistically significant changes in SDFCs for mental health and vitality domains, when multiple imputation method was adopted.

Furthermore, unlike test procedures based on SDFC, the range of changes in the SLRCs among the domains and the corresponding standard errors were smaller. There was no statistically significant change in the SLRCs for any of the SF-36 domains when complete-case analysis method was adopted. But there were statistically significant changes in the coefficients for the physical functioning and vitality domains when mean imputation or EM imputation method was adopted. Statistically significant changes in SLRCs were also noted on physical functioning, mental health, and vitality domains when multiple imputation method was adopted.

Simulation study

A Monte Carlo simulation study was conducted to examine the statistical power of test procedures based on SDFC and SLRC under complete-case analysis, as well as with mean imputation, EM imputation, and multiple imputation methods. All procedures were investigated for two independent groups. The simulation conditions investigated included: (a) total sample size, (b) number of study variables, (c) missing data mechanisms, and (d) proportion of missing data. Total sample sizes of N = 100 and 250 were investigated, representing small and medium population sizes. For each group, unequal group sizes were considered using a 3:2 ratio. The number of outcome variables was fixed at p = 4. Given that the investigated procedures are likely to be influenced by the missing data mechanism, we compared the relative importance test procedures under MCAR, MAR, and MNAR missingness mechanisms. The percentages of missing observations were set at 15 and 25 %.

The multivariate longitudinal data over two measurement occasions for each group. First, the multivariate data at baseline were generated from a multivariate normal distribution with known means and group covariances using the mvtnorm package in R [33]. The data at the second measurement occasion was generated as a linear combination of baseline data, change from baseline, and multivariate latent factor representing reprioritization RS effects on the domains. The multivariate data at the second occasion were therefore obtained as.

$$ {\mathbf{Y}}_{{ij\text{2}}} = {\mathbf{Y}}_{ij1} + {\varvec{\Delta}}_{ij} + {\mathbf{I}}_{ij} $$

where Y ij1 is the p × 1 vector of observations for the ith (i = 1, …, n j ) subject in the jth (j = 1, 2) group at baseline, Δ ij is the p × 1 change from baseline for the ith subject, and I ij is the p × 1 latent factor vector representing the RS effect for the ith subject. Assume \( {\mathbf{Y}}_{ij1} \sim N_{p} ({\varvec{\upmu}}_{j1} ,{\varvec{\Sigma}}_{1} ) \), \( {\varvec{\Delta}}_{ij} \sim N_{p} ({\varvec{\upeta}}_{j} ,{\varvec{\Phi}}_{j} ) \), \( {\mathbf{I}}_{ij} \sim N_{p} ({\varvec{\uptheta}}_{j} ,{\varvec{\Psi}}_{j} ) \), where \( {\varvec{\upeta}}_{j} , \) \( {\varvec{\uptheta}}_{j} \), \( {\varvec{\Psi}}_{j} \), and \( {\varvec{\Phi}}_{j} \) are the p × 1 vector of mean change from baseline, p × 1 vector of mean latent factors, p × p covariance matrix for the change from baseline, and p × p covariance matrix for the latent factors, respectively, for the jth group. The data for the second measurement occasion, Y ij2, were generated from a multivariate normal distribution with jth group mean as.

$$ {\varvec{\upmu}}_{j2} = {\varvec{\upmu}}_{j1} + {\varvec{\upeta}}_{j} + {\varvec{\uptheta}}_{j} $$

and the jth group covariance is

$$ {\varvec{\Sigma}}_{j2} = {\varvec{\Sigma}}_{j1} + {\varvec{\Phi}}_{j} + {\varvec{\Psi}}_{j} $$

The mean configurations for the multivariate data at baseline, change from baseline, and the latent RS effects are described in Table 3. For multivariate data at baseline, a quadratic mean configuration across the domains in Group 1 while a constant mean configuration, represented by a unit vector, was assumed as the mean vector for Group 2. On the other hand, a linearly decreasing mean configuration was assumed across the domains for θ j in Group 1, while a null vector was assumed for θ j in Group 2.

Table 3 Description of mean patterns for the simulation study

Moreover, relative importance measures are known to be influenced by both the magnitude and pattern of correlation among the observations [45]. A compound symmetric (CS) structure was assumed for the correlation among the domains at baseline. More specifically, the following CS correlation matrices were assumed for multivariate data at baseline and multivariate change from baseline: (a) Q 1: CS with ρ = 0.3 and (b) Q 2: CS with ρ = 0.7. On the other hand, a CS correlation structure was adopted for the latent factor variables such that the magnitude of the correlation among the latent factors was assumed ρ = 0.2.

Conditions for statistical power are investigated by simulating known RS effects, as measured by the multivariate latent factor, in the multivariate data at the second measurement occasion. RS effects were assumed to be present on all the domains in Group 1, but not in Group 2. The statistical power for each procedure was measured using all-variable power (ALLV), the power of a test procedure to reject the p non-null hypothesis in the family of p hypotheses. A total of 500 simulations were conducted for each of the combination of simulation conditions. For each procedure, B = 500 bootstrap samples were used to generate the distribution of differences between relative importance weights. For α = 0.05, a Bonferroni correction was used to adjust for multiple correlated tests of significance. Differences in statistical power <5 % were considered negligible, while differences greater than 10 % were considered substantial.

Simulation results

Table 4 describes the average ALLV power for relative importance test procedures based on SLRC and SDFC when complete-case analysis was conducted, as well as with mean imputation, EM imputation, and multiple imputation methods. These results were stratified by total sample size, percentage of missing data, and missingness mechanism. In general, the average ALLV power for test procedures based on the SDFC was at least 9.0 % higher than the average ALLV power for test procedures based on SLRCs, regardless of the percentage of missing data, missingness mechanism, and method for adjusting for missing data. On the other hand, the average ALLV power for each procedure decreased as the percentage of missing data increased, regardless of the sample size, mechanism of missing data, and imputation method. For example, when N = 250, the average ALLV power for tests based on SLRC and SDFC procedures were 19.3 and 30.8 %, respectively, for complete-case analysis method when 15 % of the data were missing via the MCAR mechanism. But when the percentage of missing data was 25 %, the average ALLV power for the former and latter procedures were 15.5 and 25.5 %, respectively.

Table 4 Average all-variable power (%) for relative importance tests by missing data method when p = 4 (simulation results)

There were negligible differences in average ALLV power of the test procedures under both MCAR and MAR missingness mechanisms. However, both test procedures were not equally robust to MNAR mechanism. More specifically, test procedure based on SDFC was not influenced by the pattern of missing data, but test procedure based on SLRC was substantially less powerful when the data were missing via the MNAR mechanism. For example, when 25 % of the data were missing, the average ALLV power for test procedure based on SDFC for MCAR, MAR, and MNAR were 82.9, 82.3, and 78.9 %, respectively, when complete-case analysis method was adopted. In contrast, the average ALLV for test procedure based on SLRC were 70.9, 71.4, and 56.8 %, when the data are MCAR, MAR, MNAR, respectively.

Moreover, both test procedures were sensitive to the missing data method. More specifically, the ALLV power for each procedure was smallest when the complete-case method was adopted and largest when an imputation method was adopted. On the other hand, there were no substantial differences in the average ALLV power for test procedures based on mean, EM, and multiple imputation methods. For example, for N = 250, when 25 % of the data were missing via the MCAR mechanism, the average ALLV power for test procedure based on SLRC under complete-case analysis, mean imputation, and multiple imputation methods were 70.2, 83.5, and 82.7 %, respectively.

Finally, the average ALLV power for each procedure when N = 250 was at least two times the average ALLV power when N = 100, regardless of the mean configuration, percentage of missing data, missing data mechanism, and missing data method. For example, when 15 % of the data were missing via the MCAR mechanism, the average ALLV power for test procedures based on SDFC and SLRC procedures under mean imputed data were 44.9 and 20.3 %,respectively, when N = 100. When N = 250, the average ALLV power for the test procedures under mean imputed data were 92.5 and 84.7 %, respectively.

Discussion

This study investigated the presence of reprioritization RS in a stroke caregiver population characterized by missing data. The choice of imputation methods used to address missing data will influence conclusions about the presence of RS. Neither of the two relative importance tests for reprioritization RS showed any evidence of RS when complete-case analysis was adopted. But the test procedures based on SLRC and SDFC revealed the presence of RS on physical function and vitality domains when mean imputation or EM imputation was adopted. The multiple imputation method showed evidence of RS on physical functioning, mental health, and vitality domains.

The findings about RS detection in the numeric example are consistent with results of our simulation study, which revealed that relative importance tests procedure based on complete-case analysis are less powerful in detecting RS than tests based on imputation methods. The statistical power of these test procedures decreased as the proportion of missing data increased, but the power of the test procedures increased as the sample size increased. On the other hand, there were negligible differences in statistical power of the procedures when mean, EM, and multiple imputation methods were adopted. Furthermore, our simulation results revealed that a test procedure based on changes in SLRCs was substantially less powerful when the data were MNAR compared to the power results when the data are MCAR. There were negligible differences in the average ALLV power of the test procedure based on SDFCs when the data were missing via the MCAR or MNAR mechanisms, regardless of the choice of missing data method, percentage of missing data, and sample size.

Although results from the simulation study revealed that relative importance tests based on imputation methods are more powerful to detect RS than tests based on complete-case analysis, the analysis results of the example datasets showed variations in estimates of differences in relative importance weights over the 6-month period for various imputation methods for each relative importance test procedure. The observed variations can be attributed to several factors such as imputation methodology, domain correlations, with-in domain autocorrelations, and mechanism of missing data. For example, the choice of an imputation method can affect the estimates of the difference in relative importance weights on the domains. Given that, mean imputation tends to underestimate the correlations among the SF-36 domains, estimates of relative importance weights at each measurement occasions, and estimates of differences in relative importance weights over the two measurement occasions may be biased. In contrast, both EM and multiple imputation models preserve the correlation structure by estimating the missing data through maximum likelihood and MCMC methods, respectively. We recommend that the changes in relative importance weights as well as the corresponding standard errors for the imputation methods be considered when determining whether the choice of imputation methods might influence the interpretation of the original clinical study.

The study has some caveats. Our investigations of the mechanism of missing data in the example data suggest that the missing observations for participants over the two measurement occasions are likely to be MAR, because the pattern of missingness over the 6-month follow-up was associated with self-reported domain scores at baseline. Nonetheless, it is not known whether the missing data mechanism might also be associated with unobserved variables (i.e., MNAR), and this assumption cannot be tested. On the other hand, the simulation study described here only focused on the statistical power of the test procedures when the correlations among the domains are assumed to be equal. This assumption may not be realistic in longitudinal HRQOL studies characterized by heterogeneous (i.e., unstructured) between-domain correlation matrix.

In summary, this study investigates test procedures based on relative importance measures for RS detection in incomplete longitudinal HRQOL data in the context of complete-case analysis, and three standard missing data imputation methods. Our findings revealed that the detection of reprioritization RS among the HRQOL domains is affected by the choice of both the proportion and type of missing data mechanism and the imputation methods. More specifically, relative importance test procedures for RS based on changes in SDFCs and SLRCs are most powerful when mean or multiple imputation methods are used to fill missing observations. We therefore recommend that one of these methods should be adopted when testing for RS in incomplete longitudinal HRQOL data.