Background

Fatigue is a subjective feeling of overwhelming and exhausting tiredness. In patients suffering from chronic diseases, e.g., cardiovascular diseases, neurological disorders, arthritis, chronic obstructive pulmonary diseases or cancer, fatigue can persist for months, debilitates the patient’s functioning and interferes with self-care and quality of life [1,2,3,4]. In patients with cancer, it is the most frequent complaint [5]. Prevalence rates during the treatment range from 25 to 99%, and fatigue has been reported to occur even years after treatment [6].

Fatigue can be described as “a multidimensional phenomenon, with physical, emotional, and cognitive manifestations” [7], but there is no universally accepted definition [8] mostly because the pathophysiological mechanisms underlying the development of fatigue are unclear [9]. Therefore, it is not clear, if fatigue has to be differentiated in separate dimensions (e.g., physical, emotional and/or cognitive) with separate causes like in a multiple symptom concept, or if fatigue has to be understood as the one common cause for different kinds of tiredness like in a multidimensional concept. In the first case, it is assumed that different forms of tiredness are caused by different sources; consequently, the separate dimensions are expected to behave differently and have to be differentiated to enable different treatments. In the second case (multidimensional concept), it is assumed that different kinds of tiredness are caused by only one common source, thus the dimensions should not behave differently and do not have to be differentiated. Despite the multidimensional concept is being based on a consensus of experts [10, 11], there are studies indicating that fatigue should be considered as a multiple symptom concept [11, 12].

Against this background, it is not surprising that a recent systematic review comprising 40 instruments to measure fatigue stated that the definition of dimension and domain seems to be a matter of author’s opinion [13]. Hence the comparability of data across studies is hampered in two ways: the question of the underlying concept and the large quantity of instruments. In this study, we will assume the multidimensional concept, and address the question of converting scores into one another using a selection of six different instruments.

Recently, construct-based, instrument-dependent common metrics for specific outcomes have been developed in order to facilitate comparability of data collected with different instruments, for example, for anxiety [14], depression [15, 16], physical function [17,18,19], and also for fatigue [20,21,22].

Methods to develop common metrics are described by the PROsetta Stone Project funded by the National Institute of Health (NIH)/ National Cancer Institute (NCI) [23]. Details can be found under http://www.prosettastone.org/AnalysisReport. The general approach is to model item responses to different instruments as probabilistic functions of a common, single latent trait-based unidimensional IRT models. Linking procedures using multidimensional IRT models have been developed, but are seldom applied [24]. A more detailed overview of different IRT linking procedures can be found in [25]. First validation studies in independent samples show that differences between latent trait estimates derived from different instruments are small on sample level [26, 27].

Hypothesizing that there is one common cause for different dimensions of tiredness (multidimensional concept), Lai et al. showed that fatigue can be considered sufficiently unidimensional [28], based on a 72-item fatigue bank. Using the same statistical technique of bifactor modeling, Cella et al. [29] found support for unidimensionality of the 13 items of the Functional Assessment of Chronic Illness Therapy—Fatigue (FACIT-F). These studies provide evidence that fatigue instruments can be calibrated on a unidimensional common metric to facilitate the comparison of results from studies using diverse instruments even if they assess different dimensions of fatigue.

The main aim of this study was to estimate a common metric to convert patient’s scores of six different fatigue instruments to one another. In many cases, it is also useful to compare the results with the general population (GP), which is a meaningful reference, to judge fatigue severity both for individuals and for groups of patients. Hence, we additionally anchor the common metric to the GP.

Methods

Sample and setting

The study participants were recruited in an oncological rehabilitation clinic in Germany. The rehabilitation programs, generally lasting 3 weeks, help restore patient’s physical and psychosocial functioning. Patients receive physical fitness exercises, physiotherapy, relaxation techniques, and counseling concerning occupational and nutritional behavior. Inclusion criteria for this study were age 18 years and above, sufficient ability of the German language, and absence of severe cognitive impairment. Informed consent was obtained from the study participants after they were given a full explanation of the purpose and nature of the data collection and storage. The study was approved by the Ethics Committee of the Leipzig University. A total of 1547 consecutive patients were asked to participate, and 1225 (79.2%) of them agreed to take part in the study.

Instruments

The following socio-demographic and medical characteristics were measured: gender, age (years), education, occupation, partnership, diagnosis (ICD-10), and time since diagnosis (months). All data concerning socio-demographic and medical characteristics as well as the data of all instruments mentioned below were based on self-reported information. In our analysis, we included the items from six measures of fatigue. The criteria for the selection of these measures were the frequency of the use in oncological research and treatment, novelty (especially the FA12 and the DIG11), and the availability of normative data sets for the linking to the general population (FA3, FA12, MFI4, FS11). The abbreviations of the questionnaires mentioned in parentheses are described below.

The Quality-of-Life Questionnaire Core 30, developed by the European Organization for Research and Treatment of Cancer (EORTC QLQ-C30 [30]), contains a three-item symptom scale for fatigue that we call FA3. The response options range from 1 to 4 with higher values for higher fatigue levels. The sum of the items ranges from 4 to 12; it is then transformed to a range from 0 to 100.

The EORTC QLQ-FA12 is an additional module of the EORTC QLQ-C30 that measures specific forms of fatigue (physical, emotional, and cognitive fatigue) and its interference with daily activities and social life [31]. Kecke et al. [32] and Hinz et al. [33] suggested a summary score using all twelve items. The sum ranges from 12 to 48, and is transformed to a range from 0 to 100. To indicate this scale, we use the abbreviation FA12.

From the Multidimensional Fatigue Inventory (MFI-20), we used its General Fatigue scale with four items (henceforth called MFI4). Response options range from 1 to 5 [34], and the sum ranges from 4 to 20.

The Brief Fatigue Inventory (BFI9) contains nine items which assess the actual severity of fatigue and fatigue-related impairment during the past 24 h [35]. Response options range from 0 to 10 with higher values for higher severity or interference. The sum ranges from 0 to 90; it is transformed to a range from 0 to 10.

The Fatigue Scale (FS11) measures physical and mental fatigue with eleven items [36]. The questionnaire is also known as the Fatigue Questionnaire [37] or the Chalder Fatigue Scale. Initially, it consisted of 14 items [38], subsequently revised to a version with 11 items [39]. Response options range from 0 to 3 with higher values corresponding to higher fatigue levels, the sum ranges from 0 to 33.

The Diagnostic Interview Guide (DIG11) contains eleven binary items (0 = no, 1 = yes) corresponding to the eleven symptoms that are named in the proposed ICD-10 criteria for fatigue [7]. To our knowledge, the questionnaire is not yet validated as an instrument to measure severity of fatigue as a sum score. We calculated the sum of the items, ranging from 0 to 11, indicating higher symptom burden with higher values.

Cronbach’s alpha coefficients of these measures in our sample were: alpha(FA3) = 0.89, alpha(FA12) = 0.94, alpha(MFI4) = 0.81, alpha(BFI9) = 0.93, alpha(FS11) = 0.92, and alpha(DIG11) = 0.86.

General population

Additionally to the sample of study participants, we used fatigue assessments from four different GP surveys to anchor the common metric. Each of the surveys included at least one of the fatigue questionnaires used in this study. The first survey included the quality-of-life questionnaire EORTC QLQ-C30 with the 3-item fatigue subscale FA3. The sample consisted of two subsamples with 1993 [40] and 2448 [41] participants, respectively. The second survey included the MFI-20 (n = 1993) [42]. In our analysis, we only used the General Fatigue scale of the MFI-20. The third survey included the FA12 (n = 2424) [33]. The fourth survey included the Fatigue Scale FS11 (n = 2464) [36].

Statistical analyses

Missing values can lead to biased parameter estimates, inflated standard errors, loss of information, and weakened generalizability of the results [43]. Therefore, we estimated missing values on item level using the Expectation Maximization algorithm [44] using IBM SPSS Statistics 23.

Common metric of the six fatigue scales

In order to link the six questionnaires to one another, we used a single-group design, that is, all respondents came from the same population and answered all questionnaires at the same time of data collection [25]. To place the parameter estimates on a common scale, we estimated parameters for all items at once (concurrent calibration) [45]. Test scores were equated using expected a posteriori (EAP) estimates of the latent trait for each observable sum score. For each sum score of the respective questionnaire, latent trait estimates were averaged over all possible response patterns to place the sum score on the latent trait continuum [46]. Using one common latent trait for calibration requires estimates from a model with acceptable fit, and that additionally meets the assumption of local independence, that is, item errors should vary independently.

We used confirmatory factor analysis with diagonally weighted least squares (DWLS) estimation and mean- and variance-adjusted test statistic to determine the underlying structure of the item pool. Model fit was assessed using a combinational rule of the comparative fit index (CFI) and the standardized root mean square residual (SRMR) [47]. Models were rejected if both CFI and SRMR indicated poor fit (CFI < 0.95 and SRMR > 0.06). Since these cutoffs are rules of thumb and were suggested for maximum likelihood (ML) estimates instead of DWLS, we additionally present the Tucker–Lewis Index (TLI), along with the root mean square error of approximation (RMSEA) and its 90% confidence interval (90% CI).

Initially, we fitted a unidimensional model, with all 50 items loading onto one single (general) latent factor. This model expresses the assumption of the multidimensional concept. The instruments differ in their underlying dimensional concepts of fatigue. Since, in the light of recent studies [11, 12], it seems possible that there is not only one common cause of fatigue. We additionally fitted a bifactor model that represents a limited version of the multidimensional concept insofar as it allows the common cause to be accompanied by additional differing causes. This is implemented through the modeling of three additional specific factors not correlated with the general factor to separate variance that cannot be attributed to the general factor: two factors to account for residual variance introduced by similar item content (content regarding emotional fatigue in FA12 and DIG11, and content regarding cognitive fatigue in FA12, FS11, and DIG11); and one factor for residual variance introduced by the specific time frame of the items in BFI9 (last 24 h) and their especially wide response range (from 0 to 10). The use of bifactor models is not uncommon for fatigue instruments that measure the content of different fatigue dimensions [29, 48].

Although multidimensional linking methods were proposed, especially for bifactor models [49, 50], multidimensional IRT models are seldom applied in practice [25]. We also did not apply a multidimensional model for calibration. Instead, we focused on the estimated parameters of the general factor. If the multidimensional solution was found to fit substantially better, our analytic plan was to fix the parameters of the specific factors at zero and keep the other parameters (those of the general factor) fixed at their estimated value for calibration. We used a unidimensional IRT model for item calibration; even though the estimates of the general factor came from a multidimensional model. Therefore, we additionally examined the explained common variance (ECV) for the general factor to judge the degree of unidimensionality in the data [51], and we computed coefficient omega hierarchical (COH) to estimate the degree to which the total score reflects a common dimension, because COH is the appropriate model-based measure of reliability for data with a bifactor structure [52]. Values of COH > 0.80 indicate that the items are sufficiently unidimensional [53]. To test the assumption of local independence, we examined the residual correlations between each pair of items. A value of 0.2 above the average correlation indicates local dependence [54]. These analyses were performed with R Statistics using the R package “lavaan” [55].

To establish the common metric, we chose the better fitting model (unidimensional or bifactor model) and estimated it within the Item Response Theory (IRT) framework [56] to predict the underlying latent trait using the specific item responses. Binary items followed a two-parameter model (2PL) whereas items with more than two response categories followed a graded response model (GRM) [57]. We used the item parameters of the general factor (common fatigue dimension) to compute theta values for each possible sum score from each questionnaire [58]. These analyses were done with R Statistics [59] using the R package “mirt” [60].

To judge the accuracy of the links between the six measurements, we examined Bland–Altman plots [61] for each pair of questionnaire. Based on the theta values estimated separately for each questionnaire, the figures show the difference between theta values of two measurements (Y axis) as a function of the mean value of theta values (X axis). While the mean is the best estimate of the true value [62], the difference indicates the bias between both methods for each estimated true value. Each figure contains a straight line that indicates the mean of the differences, and two dotted lines indicating the 90% limits of agreement (90%-LoA) [63]. The curved line shows a locally fitted nonparametric regression line [64], also called “loess” smoothing. With the help of this line, it is easy to judge visually, if the distortion between both methods crosses the line of the confidence interval, without focusing on individual data points.

Linking the common metric of the patients to the general population

In the resulting IRT model, latent trait estimates (theta) of the calibration sample have a mean of M = 0 and a standard deviation of SD = 1 due to constraints necessary to fully identify the model. We anchored the common metric to the GP to facilitate interpretation. Therefore, we estimated T-scores (M = 50, SD = 10) for each sum score observed in the GP, assuming a normal distribution of the latent trait. We also calculated the theta values for each sum score of the respective questionnaire in the patient sample. For each GP sample, the T-scores (GP) were regressed on the respective theta values (patients) to obtain a transformation formula for theta values into GP-based T-scores. For these analyses, we used Microsoft EXCEL 2010.

Results

The questionnaire was completed by n = 1225 patients. We estimated missing values for the 50 items from the instruments mentioned above. The proportion of missing values per item ranged from 0.2 to 1.2%. All patients answered more than half of the items. Of n = 1225 patients, n = 1118 (91.3%) answered all items completely. Altogether the proportion of missing values was low, i.e., 265 missing of 61,250 values (0.43%) were imputed, so that almost every technique for dealing with missing values would lead to similar results [43, 65].

Socio-demographic and medical characteristics of the patient sample

Table 1 presents the socio-demographic and medical characteristics of the patient sample. On average, the patients were 55.8 years old (ranging from 18 to 88 years), 18.7% of them were between 18 and 40 years, and 45.9% were 60 years or older. 52.2% of the patients were women, and nearly a third (33.1%) had more than 10 years of education. Three-quarters of the patients (75.6%) answered the questionnaire within the first year of diagnosis. The three most common diagnoses were breast cancer (25.0%), prostate cancer (19.2%), and cancer of the gastrointestinal tract (18.1%).

Table 1 Socio-demographic and medical characteristics of the patient sample (n = 1225)

Common metric of the six fatigue scales

Table 2 presents the fit indices for the unidimensional model and the bifactor model. The unidimensional model that contains only the general factor for all 50 items did not show acceptable fit (CFI = 0.85, SRMR = 0.10). The bifactor model with the three additional specific factors showed acceptable fit (CFI = 0.94, SRMR = 0.06).

Table 2 Model fit (n = 1225)

The average residual correlation was − 0.01, ranging from − 0.16 to 0.18 except for two pairs of items: item 6 and item 7 of the FS11 (r = 0.30) and item 5 and item 6 of the DIG11 (r = 0.36) due to response-dependency. Signs of local dependence indicate inflated reliability and problems with construct validity [54]. To estimate the bias caused by local dependence, we additionally tested a bifactor model where these two pairs of items were parceled that is they were summed up, resulting in a model with 48 items instead of 50. We refrained from deleting either item because this might change content, convergent, and divergent validity of the instruments.

For the model with 50 items, the general factor explained more than 80% (ECV = 0.81) of the common variance that was explained by all four factors together. The reliability of the general factor was high (COHgeneral = 0.96). The bias in reliability caused by local dependence was negligible: COHgeneral of the model with 48 items decreased by a value of less than 0.003 (ECV decreased by less than 0.004). Likewise, the model fit did not change substantially: CFI (scaled) decreased by a value of less than 0.005 and SRMR increased by less than 0.002. Because there was no substantial bias in these psychometric key features, we used the bifactor model with 50 items for calibration. An inspection of the item fit statistics of this model showed no significant misfit after correcting the significance level of 0.05 for multiple testing (alphacorrected=0.001). The lowest p value showed item 11 of the questionnaire FS11 with Chi²(df) = 201.8(152), p value = 0.004. The IRT item parameters for this model are presented in Appendix 1.

For judging the accuracy of the links between the six measurements, Table 3 shows the mean of the differences in theta values between each pair along with the confidence intervals. The mean difference (bias) between two questionnaires ranges from 0.003 (DIG11 and FA3) to 0.037 (BFI9 and FA12). This means that the largest distortion between two linked questionnaires is less than 4% of standard deviation in theta value. This result is reflected in the Bland–Altman plots (see Appendix 3). No smoothed regression curve exceeds the limits of agreement for any pair. The less biased conversions are those where the straight line is near to zero and the curved line is nearly horizontal.

Table 3 Mean of differences in theta values between each pair of the questionnaire

Figure 1 shows the standard error of measurement (measurement precision) depending on the theta value for each instrument within the common metric. In the theta range from minus one to one, all questionnaires achieved reliabilities above 0.8. The BFI9 was the most reliable instrument, achieving reliabilities above 0.90 in the broadest range of theta, i.e., from − 2 to 3, essentially covering the full range of theta expected. Furthermore, one can see that the questionnaires with either few response options (DIG11) or few items (MFI4, FA3) are less reliable than the others.

Fig. 1
figure 1

Measurement precision of the six fatigue scales. SEm standard error of measurement, FA3 sum of the three-item fatigue scale of the EORTC QLQ-C30, FA12 sum of the twelve items of the EORTC QLQ-FA12, MFI4 sum of the four-item General Fatigue scale of the MFI-20, BFI9 sum of the nine items of the BFI, FS11 sum of the eleven items of the FS, DIG11 sum of the eleven binary symptom items of the DIG for Fatigue

Figure 2 visualizes the common metric of the six questionnaires, linked to the GP. The left axis shows the theta values (mean = 0, standard deviation = 1) of the patients. Each data point of the six questionnaires represents a sum score with a mean theta value (see also Appendix 2 with Supplementary Tables S1 to S6). For example, a score of 33 in the three-item fatigue scale FA3 corresponds to a theta value of -0.81. An above-average burden of fatigue in patients (theta > 0) corresponds to scores of FA3 > 56, FA12 > 36, MFI4 > 13, BFI9 > 35, FS11 > 16, and DIG11 > 6.

Fig. 2
figure 2

Common metric of the six fatigue scales. FA3 sum of the three-item fatigue scale of the EORTC QLQ-C30 (item range = 1–4, scale range = 0-100), FA12 sum of the twelve items of the EORTC QLQ-FA12 (item range = 1–4, scale range = 0-100), MFI4 sum of the four-item General Fatigue scale of the MFI-20 (item range = 1–5, scale range = 4–20), BFI9 sum of the nine items of the BFI (item range = 0–10, scale range = 0–10), FS11 sum of the eleven items of the FS (item range = 0–3, scale range = 0–33), DIG11 sum of the eleven binary symptom items of the DIG for Fatigue (item range: 0/1, scale range = 0–11), T-scores(GP) estimation of T-scores for the German general population (mean = 50, standard deviation = 10)

Linking the common metric of the patients to the GP

The regression analyses for the T-scores of the GP (dependent variable) and the respective theta values of the patients (independent variable) led to the following transformation formula:

$${T\_score}_{GP}=61.34+8.95\times {theta}_{patients}.$$

Based on this relation, we see that an average burden of fatigue in the patient sample (theta(patients) = 0) is more than one standard deviation above the average burden in the GP: T score(GP) = 61.3.

Discussion

The central result of this study, the common metric of six different questionnaires measuring fatigue, is summarized in Fig. 2 and in Appendix 2, Supplementary Tables S1 to S6. We could show that in the theta range from plus/minus one standard deviation, i.e., for values of approximately 66% of the patients, all questionnaires achieved reliabilities above 0.8. For measuring in the medium range of theta, all questionnaires proved to be sufficiently reliable, whereas the BFI9 was the most reliable questionnaire in the whole theta range. Questionnaires with few items or few response options proved to be less reliable than others. Altogether, the reliability coefficients are sufficient to compare the questionnaires. While in the classical test theory (CTT), the standard error of measurement that is used to estimate the reliability, is a constant, in the IRT it is a variable depending on the test information function. Hence the expected score fluctuations due to error, which are critical to interpretations of individual scores, can vary [66]. This is one of the advantages of IRT over CTT. The theta values appear to be reasonable since they range between − 3.28 (FS11) and 3.86 (BFI9) that is, no points deviate more than four standard deviations from the mean. Likewise, the mean (theta = 0) corresponds to scores that are located in the medium score range of the instruments (FA30–100 = 58.6, FA120–100 = 37.9, MFI4420 = 13.0, BFI90–10 = 4.0, FS110–33 = 16.7, and DIG110–11 = 6.3).

The data represented in Appendix 2, Supplementary Tables S1 to S6, indicate the position of the scores that actually occur when an individual completes a questionnaire. However, mean scores of groups of individuals generally differ from the points directly indicated in the figure or the supplementary tables. The FA3 mean score of 30.5 obtained in a study with 221 breast cancer patients [67] is between the scores of 22 and 33 which correspond to theta scores between − 1.18 and − 0.81 according to Fig. 2 or Supplementary Table S1. In such cases, we recommend a linear interpolation in the following way:

$$theta\left(30.5\right)=-1.18+\frac{30.5-22}{33-22}\cdot\left(-0.81-\left(-1.18\right)\right)=-1.18+0.77\cdot 0.37=-0.89.$$

This interpolation can also be performed visually with a sufficient degree of accuracy using Fig. 2.

Another way to illustrate the value of this metric is the comparison of results from studies that used different questionnaires. For example, three studies with breast cancer patients [68,69,70] used three different fatigue questionnaires with the mean scores of BFI9 = 3.04 [68], FA3 = 36.5 [69], and MFI4 = 11.3 [70]. Using Fig. 2, the similarity of the findings can be assessed.

Furthermore, the common metric also allows extrapolating cutoffs for different questionnaires. Concerning a score in FA3 greater than 39 (proposed by Giesinger et al. [71] as a threshold for clinical importance), we infer that this value corresponds to a value of theta(patients) ≈ − 0.60:

$$theta\left(39\right)=-0.81+\frac{39-33}{44-33}\cdot\left(-0.43-\left(-0.81\right)\right)=-0.81+0.55\cdot 0.37=-0.60.$$

Hence the FA3 threshold corresponds to scores of FA12 > 22, MFI4 > 9, BFI9 > 2.6, FS11 > 13, and DIG11 > 3.

Finally, we linked the theta values of the patient sample to T-scores of the GP. We presented a formula which can be used to estimate the burden of fatigue in the GP that corresponds to the burden in a comparable patient sample. This allows a straightforward comparison to a meaningful reference.

Limitations

Assuming the multidimensional concept, we modeled a unidimensional and a bifactor model. Only with the bifactor model, which represents a weaker version of this concept, we did find the data to be reasonably unidimensional despite the differences in questionnaire design and background. However, the share of variance attributed to the specific factors was small compared to the general factor. On the basis of this general factor, we developed a common metric. We anchored the common metric to the GP on the basis of four large samples assuming a linear relationship between the burden of fatigue in the patient population and in the GP. This assumption was supported by highly explained variances of the four regression models ranging from 97.2 to 99.7%. Although estimating item parameters in the general population samples and subsequent linking would have been a viable way to scale the model, we chose the regression approach, because we expected that in the GP higher levels of fatigue would be underrepresented or even absent, making item parameter estimation unreliable. For example, the interquartile range of FA12 differed between patients and GP substantially: for patients from 22.2 to 58.3, versus for the GP from 0 to 27.8. To avoid presenting four different continuums of T-values for the GP, we averaged the regression constants and coefficients to obtain one transformation formula. The advantage of this approach is that, although it does not reflect the values of the general population in terms of a single questionnaire with the same accuracy, it is easier to apply because it does not require differentiation between different questionnaires in the GP.

A particular strength of this study is that we included six different instruments, but the plethora of fatigue items in the questionnaire booklet might also have tired the patients. To tone down this effect, we put questionnaires concerning other topics (quality of life, optimism, self-efficacy, coping, and others) between the fatigue questionnaires.

Although the selection of questionnaires might seem rather arbitrary, we selected fatigue instruments that are frequently used in samples of cancer patients. For the Fatigue Diagnostic Interview Guide (DIG), a detailed psychometric examination of a total scale is still pending: nevertheless, we included this scale in our analysis. From the Multidimensional Fatigue Inventory (MFI-20), we chose one subscale with four items in respect to the length of the questionnaire for the patients. This subscale, called General Fatigue, is assumed to assess general aspects of fatigue. For the EORTC QLQ-FA12, we used an overall score that includes all 12 items without regard to the respective sub-dimension. This is a consequence of assuming the multidimensional concept, and it was justified by the results of the bifactor model. It is beyond the scope of this study to discuss which conceptualization of fatigue is appropriate; hence, we rather state possible critical decisions.

Another concern that is not only connected to the issue of dimensionality is that we did not remove any item from the calibration. The main reason for this is that a linking using all items does reflect the relationships between the questionnaires more realistically than a linking that uses only the matching items: a scoring in which individual items may not be used loses its practicability. Furthermore, inclusion of all items of each questionnaire allows transforming sum scores on the common metric even if raw item response data are not available. This is particularly evident if the data are to be compared with studies in which the scores are reported that are comprised of all items of the respective questionnaire. Given these arguments, we believe our approach provides an optimal trade-off between practicability and conceptual or theoretical grounds.

A further strength of our study is an analysis based on a large sample of cancer patients. Even in the range of severe fatigue, the sample size is sufficient for profound estimates. Existing common metrics for fatigue questionnaires [20,21,22] were based on samples from the general population only. They do not allow comparing the results from a sample of cancer patients against the general population. Furthermore, they did not consider more than three questionnaires. For example, Lai et al. [22] included three questionnaires: Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-Fatigue), Medical Outcomes Study Short Form (SF-36) Vitality Scale, and Quality of Life in Neurological Disorders (Neuro-QOL).

A limitation of the study is that we did not validate the common metric in an independent sample of patients. It is possible that item parameters differ across samples. Therefore, the generalizability of our results to other samples with different characteristics is unclear. This is true in particular for samples differing in age, gender, and time since diagnosis, because these characteristics are associated with fatigue [67, 72, 73]. Limited generalizability may also be the case when comparing samples of patients without cancer or with other diseases that show fatigue, but Cella et al., 2016 [1] presented support for the assumption that measures of the Patient-Reported Outcomes Measurement Information System (PROMIS) fatigue item bank may be validly compared across different chronic conditions. Furthermore, the comparison, or more precisely, the conversion of fatigue questionnaires on the basis of our common metric is reasonable only if the multidimensional concept underlies the interpretation.

Conclusions

Our study enables researchers and clinicians to directly compare the results between multiple different studies using different fatigue instruments, and it enables the simultaneous comparison of the results with the metric derived from the general population.