Introduction

Patient-reported outcome measures (PROMs) reflect the perceived impact of a specific clinical condition on individuals and are extensively used to measure health care interventions [1]. A knee-specific PROM should be brief and provide a summary measure of overall knee impact, along with pain, function, and quality of life (QoL) [2]. In order to measure physical function, self-reported measures of function and testing of the execution of a specific task associated with function (performance-based tests) could be used [3]. Additionally, performance-based measures aim at quantifying what patients can actually do; the most relevant functional domains are level walking, stair negotiation, and sit-to-stand movement [3, 4]. On the other hand, PROMs assess patients’ perceptions about their abilities [3].

Knee osteoarthritis (KOA) implies an enormous burden of illness for people suffering it [5], so it is necessary to have valid instruments to measure the perception of the burden illness by patients with KOA. The Knee injury and Osteoarthritis Outcome Score (KOOS) is a knee-specific instrument, developed to assess patients’ opinion about their knee and associated problems [6], and is one of the most widely used PROMs to evaluate patients with KOA. However, completing the 42-item questionnaire presents a significant burden for patients, and it is often regarded as being time-consuming for routine clinical use [7].

KOOS-12 is a shortened version of KOOS and was developed using item response theory (IRT) methods, as well as patients’ opinions, clinicians, and researchers on its content, clinical importance, and translatability [2]. Evaluation of the psychometric properties of KOOS-12 demonstrates that it is a valid and reliable instrument to be used in patients with KOA who had a total knee replacement (TKR) [7, 8]. Psychometric analyses of the Spanish version of the KOOS-12 questionnaire are required to assess whether the scale measures the patients’ opinion about their knee and associated problems as intended in Spanish-speaking populations and populations other than KOA patients with TKR. This study aimed at assessing the reliability, construct validity, and responsiveness to change of the Spanish version of the KOOS-12 questionnaire in patients with KOA.

Methods

This study was based on a validation design. Consecutive outpatients in the orthopedic and rheumatology clinics of two secondary care public hospitals were invited to participate. Admitted subjects were those diagnosed with primary KOA as established by the American College of Rheumatology [9] with Kellgren–Lawrence (K-L) grade one to four [10]. In bilateral knee involvement, the degree of the worst knee was recorded as the K-L grade. To investigate known-groups validity, individuals without KOA (healthy people) older than 18 years were invited to participate. A prior total knee replacement surgery, joint surgery six months before, another rheumatic disease (e.g., rheumatoid arthritis, psoriatic arthritis, and fibromyalgia), diabetic neuropathy, any known malignancy or major organ failure, neurological diseases, and unwillingness to complete the questionnaire were reasons for exclusion.

The sample size calculation was based on recommendations by experts in this field. For the factor analysis, the sample size should be at least seven times the number of items (i.e., 28 patients per scale) with a minimum of 100 [11]. Concerning sample size for the Rasch modeling analysis, recent guidelines indicate that a sample of ≥ 200 patients allows robust estimates of the model parameters [11]. Some authors consider a sample size of 200 participants as the minimal sample size for estimating stable GPCM parameters [12,13,14]. Also, a minimum sample of 200 participants was required.

Patients who agreed to participate were asked to complete a questionnaire set (paper and pencil format). Following completion of the questionnaires, the functional capacity of each patient was measured by two performance-based tests to assess physical function. The study was in accordance with the ethical standards of the Declaration of Helsinki and was approved by the Ethics Committee of the Instituto Mexicano del Seguro Social (approval date 2019-01-02; approval number R-201-3201-085), and patients signed an informed consent to participate.

Measures

KOOS

It comprises 42 items with five scales: (1) pain, frequency, and severity during functional activities; (2) symptoms; (3) function in daily living (ADL), difficulty experienced during everyday activities; (4) sport and recreational activities, difficulty experienced with sport and recreational activities; and (5) knee-related QoL. Patients respond to each item based on their knee condition over the previous week on a five-point rating scale. Such subscales are scored separately, a total score is not recommended. Scores are transformed to a 0–100 scale; higher scores represent better outcomes [15].

KOOS-12 (electronic supplementary material)

It contains three domain-specific scales that measure pain, function, and knee-specific QoL. At least half of the items in the scale must be answered to calculate a scale score, and a person-specific estimate is imputed for any missing item data. Scores are then transformed to a score from 0 to 100: 0 is the worst and 100 is the best possible score. The KOOS-12 summary knee impact score was calculated as the average of the three scales scores. A summary score is not calculated if any of the three scale scores are missing [2].

Knee intermittent and constant osteoarthritis pain

It comprises 11 items. Five items evaluate constant pain, and six items consider intermittent pain. Patients respond to each of them based on their knee condition over the previous week on a five-point rating scale. Total scores are created by adding up item scores and normalizing from 0 (no pain) to 100 (extreme pain); higher scores represent worse outcomes [16].

International knee documentation committee subjective knee evaluation form

It comprises 18 items, in the domains of symptoms, functioning during activity of daily living and sports, current function of the knee, and participation in work and sports. The total score is calculated as (sum of items)/(maximum possible score) × 100. Possible score goes from 0 to 100, where 100 means no limitation with daily or sporting activities and the absence of symptoms; lower scores represent worse outcomes [4].

World Health Organization Disability Assessment Schedule 2.0

It measures people’s activity limitations and participation restrict ability per the constructs included in the International Classification of Functioning, Disability, and Health(ICF). The World Health Organization Disability Assessment Schedule 2.0 (WHODAS 2.0) cognition, mobility, and ADL are self-administered questionnaires based on six, five, and eight items, respectively. Subscale scores are created by summing item scores and normalizing from 0 (no disability) to 100 (total disability pain) [17].

Performance-based measures

The 30-s Chair Stand Test (30-s CST) and Timed Up and Go test (TUG) were applied using a folding chair without arms; with a seat height of 43 cm, 30-s CST is a performance-based measure that evaluates the activity “sit-to-stand movement.” The test is executed by scoring the maximum amount of complete chair stand movements during 30 s [18]. Time (seconds) is taken to rise from a chair, walk 3 m, turn, walk back to the chair, and then sit down wearing regular footwear, using a walking aid if required [18].

Radiographic severity of knee osteoarthritis

Bilateral weight-bearing anteroposterior and lateral semi-flexed radiographs were recorded for both knees in each subject. They were radiologically graded according to the Kellgren–Lawrence classification [10]. The radiographs were evaluated blindly by an experienced rheumatologist (GHB).

Reliability assessment

The internal consistency of the scale was evaluated using the Cronbach’s alpha coefficient, McDonald’s omega coefficient, and the IRT reliability coefficient. Interpretation of the IRT reliability coefficient is similar to Cronbach’s alpha; a value between 0.80 and 0.95 was considered good internal consistency [19, 20]. For the test-retest reliability evaluation, all patients were invited to a second evaluation. The clinical assessment was repeated by the same physician 14 days after the first assessment at the same study site. The test-retest reliability was calculated by using an intra-class correlation coefficient (ICC, two-way mixed-effect ANOVA model with interaction for the absolute agreement between single scores). An ICC > 0.70 was considered adequate [20]. Measurement error was obtained by evaluating the standard error of measurement (SEM), smallest detectable change (SDC), and Bland-Altman limits of agreement [20].

Validity assessment

Construct validity was evaluated by structural validity, relationships to scores of other instruments, and differences between relevant groups [21]. Structural validity was assessed by confirmatory factor analysis (CFA). The diagonal weighted least squares estimation (DWLS) method with polychoric correlations was used to estimate the factorial model parameters [22]. Factorial loadings > 0.70 are desirable [21]. The goodness of fit of the model was analyzed with the chi-square test, whose p value ≥ 0.05 indicates that the proposed model fits the data. Other indicators of good fit are the root mean square error of approximation (RMSEA) < 0.06, the standardized root mean square residual (SRMR) ≤ 0.08, the comparative fit index (CFI) ≥ 0.95, and the Tucker-Lewis index (TLI) ≥ 0.95 [22, 23]. An index RMSEA < 0.08 demonstrates an adequate fit, while an RMSEA > 0.1 is considered a poor fit [24].

Construct validity was considered adequate if expected correlations were found with existing measures assessing similar (convergent validity) and different (divergent validity) constructs [4]. To establish convergent validity, some hypotheses were tested:

  1. 1.

    KOOS-12 pain should present a positive and strong correlation (r > 0.7) with KOOS pain, ADL, and QoL [8], and negative and moderate correlation (r > − 0.6) with ICOAP [25,26,27].

  2. 2.

    KOOS-12 function should present a positive and strong correlation (r > 0.7) with KOOS ADL [8], and moderate (0.4 < r < 0.7) with KOOS sport, TUG, and 30-s CST.

  3. 3.

    KOOS-12 summary should present a strong correlation (r > 0.7) with KOOS pain, ADL, and QoL, and moderate to strong (r > 0.4) with TUG and 30-s CST.

  4. 4.

    IKDC score would show a positive and moderate correlation with the KOOS-12 pain and QoL, and strong correlation with KOOS-12 function and summary.

  5. 5.

    Correlation between KOOS-12 function and WHODAS 2.0 mobility and activities of daily living subscales should be moderate to strong (0.4 < r < 0.8).

To establish the divergent validity, the following hypotheses were tested:

  1. 1.

    KOOS-12 pain scale should present a positive and moderate correlation (0.4 < r < 0.69) with KOOS symptoms and sport [8].

  2. 2.

    KOOS-12 QoL should show a moderate correlation with KOOS pain, symptoms, and sport.

  3. 3.

    Correlation between KOOS-12 function scale and WHODAS 2.0 cognitive disability scale should be low (r < 0.4).

Known-groups validity was measured by testing a priori hypotheses about subgroups expected significant differences in mean KOOS-12 scores. Hypotheses were formulated as follows: (1) KOOS-12 scales and summary score in healthy people would be significantly higher than score in patients with KOA; (2) KOOS-12 subscales and summary score would be significantly higher in patients with KOA grade 1 compared with patients with grades 3 and 4; and finally, (3) the KOOS-12 summary score would be significantly lower in obese patients.

Item response theory analysis

The G2-LD index and Q3 index were used to evaluate the local independence assumption of items [28, 29]. The partial credit model (PCM) and generalized partial credit model (GPCM) were used to obtain item and person parameters using the marginal maximum likelihood estimation with expectation-maximization (MML-EM) algorithm. Two models fitted the data, and their overall fits were compared using the likelihood ratio test (LRT). The LRT assesses whether the model with unrestricted values for the discrimination parameter is necessary to improve the model’s fit. In the PCM (an extension of the Rasch model), all item response functions have the same discrimination parameter (α). In GPCM, discrimination parameter was allowed to vary across items. Difficulty parameters (β-parameters) were interpreted as standard deviations showing the range of latent trait covered by the item. The higher the β-parameters, the higher the trait level a respondent needs to endorse that response option [30]. The discrimination parameter measures the strength of the relationship between the item and the latent trait being measured [30]. Item fit was assessed using S-X2, and misfit was indicated by significant results with a Benjamini and Hochberg adjusted overall alpha level of 0.05 [31]. The overall model fit was analyzed using limited information statistics (M2), along with the associated RMSEA and SRMR index [32]. A p value > 0.05 of the M2, RMSEA < 0.05, and SRMR < 0.027 demonstrates an excellent fit, while an RMSEA <0.089 and SRMR <0.05 are considered an adequate fit [32]. Differential item functioning (DIF) was investigated for age (< 64 years, 64–71 years, and > 71 years), sex (male vs. female), and education level (< 10 years vs. ≥ 10 years). DIF was declared present if significant differences in model fit between non-DIF and DIF models were observed [15]. Person fit was examined by using the standardized statistic Zh (Drasgow, Levine, and Williams) [33]. Person-fit statistics compare a person’s observed and expected item scores across test items. Patients with Zh-values above or below 2 reflect participants with “atypical” or “inconsistent” response patterns [33]. Large negative Zh-values indicate non-fitting response patterns given the model and the trait value.

Floor and ceiling effects

Floor or ceiling effects were considered present if more than 15% of respondents achieved the lowest or highest possible score [20].

Responsiveness assessment

From 199 patients at the beginning of the study, 38 received intra-articular treatment (Hylan-GF 20, collagen-PVP, or glucocorticoids) and were included in the responsiveness assessment, which was performed 2 months after the first dose of the treatment. Three methods were used to evaluate responsiveness: the standardized response mean (SRM), effect sizes, and hypothesis testing. For the interpretation of the SRM, the following cutoff points were established: 0.20, 0.50, and 0.80 to indicate a low, moderate, and high sensitivity to change, respectively [34]. Hypothesis testing assessed whether the changes in pain intensity measured by KOOS-12 subscales and summary were correlated (r ≥ 0.70) with changes measured by KOOS subscales, IKDC, and ICOAP.

Statistical analysis

Results are presented as n (%) for categorical variables and as mean and standard deviation (mean ± SD) or median (interquartile range) for continuous variables, as appropriate. To evaluate the differences among two groups, the t test was used, and a size effect estimation was reported with Cohen’s d, considering 0.2, 0.5, and 0.8 as threshold values to estimate low, medium, and large size effects, respectively [35]. One-way ANOVA with multiple comparisons was conducted using the Bonferroni test to discern differences between groups [36]. Size effect estimation was reported with eta squared, considering 0.01, 0.06, and 0.14 as threshold values to estimate low, medium, and large size effects, respectively [35]. Strength and direction of the relationship between two variables were evaluated using Pearson’s correlation coefficient (r) if both variables are measured on an interval scale and normally distributed, otherwise using Spearman’s correlation coefficient (rho). Statistical analysis was performed using the R statistical program (2020, R Core Team, Vienna, Austria). CFA was approached with the lavaan package, the parameters of the GPCM were determined with the mirt package, and DIF analyses using proportional odds cumulative logistic models were run in the lordif package.

Results

A total of 199 patients with KOA and ten healthy people participated in this study. One hundred and sixteen patients were re-evaluated for reliability testing. The mean age of the participants in validation sample (n = 209) was 63 years (minimum 34, maximum 90 years), and 78.95% (n = 165) were women. The median scores on KOOS-12 pain, function, quality of life, and summary were 43.75, 37.5, 31.25, and 37.5, respectively. Rates of missing data were low (< 2%). The characteristics of all included patients at baseline, test-retest sample, and responsiveness sample are presented in Table 1.

Table 1 Clinical and demographic characteristics of participants

Reliability

KOOS-12 showed appropriate internal consistency. Cronbach’s alpha was 0.87, 0.91, 0.85, and 0.94 for KOOS-12 pain, function, QoL, and summary, respectively. The omega coefficient was 0.87, 0.90, and 0.86 for KOOS-12 pain, function, and QoL, respectively. The ICC of KOOS-12 was 0.63, 0.60, 0.71, and 0.71 for the pain, function, QoL, and summary, respectively. Standard error of measurement values ranged between 9.38 and 13.19. The smallest detectable change ranged from 28.32 to 36.56 (Table 2).

Table 2 Test-retest reliability and responsiveness to change of the KOOS-12 and KOOS questionnaires

Structural validity

Three separate analyses were carried out on the pain, the function, and the QoL scale (Table 3). All items load strongly (factorial loadings > 0.7) onto their respective factors. CFA revealed that one-factor for KOOS-12 function and QoL models showed a good model fit. The one-factor KOOS-12 pain model had an adequate model fit.

Table 3 Results from classical item analysis and unidimensionality analysis of the KOOS-12 questionnaire. Factor loadings (standard error) and goodness of fit indices from confirmatory factor analysis

Item response theory analyses

IRT analyses were conducted for KOOS-12 pain, function, and QoL scale separately. For all scales, the PCM (Rasch-based) was tested against the GPCM, and this one fit better than the PCM (Supplementary Table S1). The item parameter estimates from KOOS-12 subscales of the GPCM calibrations are reported in Table 4. All the items included in three scales presented a good fit at the item level. There was no item local dependence. No DIF was found between sex, age, or education level. The function and QoL scales had an overall good model fit, and the order of categories’ thresholds for all items was good. Person-fit statistics detected < 4% persons with “atypical” response patterns. Within the IRT framework, all scales yielded appropriate reliability (Table 4).

Table 4 Estimated slope, location, and threshold parameters, model fit, and reliability coefficient for the KOOS-12

KOOS-12 pain scale

Item 1 “Frequency knee pain” had the lowest discrimination parameter, indicating these items did not discriminate as well between respondents as other items. The four items of the pain scale covered a wide range of difficulties ranging from − 1.61 to 1.69. Item 4 “Pain sitting or lying” was the item with the most considerable difficulty on the pain scale, that is, high levels of pain severity are required for the patient to have a higher probability of selecting the last category of the response options “extreme.” In contrast, the item with the least difficulty was the item 3 “Pain up/downstairs,” that is, very low levels of knee pain severity are required for the patient to have a higher probability of selecting the response category “None.” Conversely, the category 4 “extreme” is the high probability for the patients with higher knee severity, but probability decreases as knee pain severity does.

Item 2 “Pain walking on flat” in KOOS-12 pain scale provided more information than the other items. The overall fit of the model for the pain subscale was acceptable (M2 = 0.01, RMSEA = 0.12, and SRMR < 0.05; Table 3). The reliability coefficient for KOOS-12 pain was 0.90, yielding appropriate reliability.

The correlation of the scale scores was calculated using the summated and transformed scoring method (0 to 100) and using the IRT-based scoring (Theta or latent trait) were very strong. The Pearson correlation coefficient between the scores obtained by these two methods was − 0.983 (95%CI 0.987 to − 0.978), − 0.982 (95%CI − 0.986 to − 0.976), and − 0.974 (95%CI − 0.980 to − 0.966) for pain, function, and QoL, respectively (Supplementary Figure). Therefore, it was considered to present the results on a 0 to 100 scale.

Convergent validity, divergent validity, and validity of known groups

Seventy-nine percent of the hypotheses raised for the evaluation of the convergent validity were verified. Similarly, 83% of the hypothesis in the assessment of the divergent validity were confirmed. KOOS-12 pain scale showed a very strong correlation (rho ≥ 0.79) with KOOS pain and ADL scales and ICOAP scale. KOOS-12 scale function showed very strong correlations (rho ≥ 0.80) with KOOS pain, ADL and sport scale, IKDC, and ICOAP scale. KOOS-12 QoL was strongly correlated with KOOS ADL and sports scales, and IKDC scale. KOOS-12 summary scale presented very strong correlations with KOOS pain, ADL and QoL scales, IKDC, and ICOAP scale. Relationship between KOOS-12 scale and subscales was moderately correlated with TUG and the 30-s CST (0.46 < rho < 0.55; Table 5).

Table 5 Construct validity of KOOS-12 scales and summary. Spearman’s rho correlation coefficients (95% confidence interval) among KOOS-12, disease-specific measures, and performance-based measures

Digital radiographs were available for 172 patients to assess known-groups validity. As hypothesized, patients with major K-L grading reported more pain severity. Post hoc analysis demonstrated a significant decrease in KOOS-12 scales and summary score among subjects with KOA vs. healthy controls (p < 0.01), grade 4 vs. grade 1 (p < 0.01), grade 4 vs. grade 2 (p < 0.01), and grade 4 vs. grade 3 (p < 0.05). Otherwise, no significant differences were found. KOOS-12 pain, function, QoL, and the summary score were significantly higher in non-obese than in obese patients. Validity results of known groups were very similar using the summed scores or the Theta levels in the comparison of the groups (Table 6).

Table 6 ANOVA results for estimated KOOS-12 scales and summary by the Kellgren–Lawrence classification and nutritional status using summated and transformed scores, and item response theory (IRT) scores

Floor and ceiling effects

None of the KOOS-12 scales presented a floor or ceiling effect. In patients with KOA, 2.01% (n = 4) had the highest score (best outcome) on KOOS-12 scales and summary. Similarly, 2.01% (n = 4) presented the lowest score (worst result) on KOOS-12 scales and in KOOS-12 summary.

Responsiveness

Responsiveness assessment indicated that KOOS and KOOS-12 were sensitive to change. There were significant improvements in KOOS-12 scores eight weeks after intra-articular treatment. KOOS-12 summary had the highest effect size of all scales (Table 2). SRMs for KOOS-12 scores ranged from 0.75 to 0.94 (Table 2). SMR for KOOS-12 summary score was higher than the three KOOS-12 scales and the three KOOS scales evaluated. KOOS-12 summary score had strong (r ≥ 0.76) and significant correlation with the KOOS pain and ADL scales, ICOAP score, and IKDC score, and had no significant correlation with the Timed Up and Go test and 30-s CST (Supplementary Table S2).

Discussion

KOOS-12 is a short self-reported measure that assesses patient’s opinions about the difficulties they experience due to problems with their knee and also covers aspects of pain, functional limitations, and knee-related QoL [8]. Therefore, there are currently three versions of KOOS in Spanish (adapted to Spain, Peru, and United States Spanish). Instead of doing another translation, this study evaluated the psychometric properties of KOOS-12 using the Spanish version for Peru. Some patients needed help clarifying some response options, so minor modifications were made to improve the understanding of the response options. Some patients had difficulty understanding the difference between “daily” to “always.” Therefore, the clarification Una vez al día was added to the “daily” option. Similarly, to the response options “severe” and “very severe,” a clarification was added Severo/Fuerte and Muy severo/Extremo.

The Spanish version of the KOOS-12 questionnaire shows appropriate internal consistency reliability for evaluating patients’ knee problems [9]. This confirms previous findings, Cronbach’s alpha was 0.75–0.82, 0.78–0.82, 0.80–0.84, and 0.90–0.93 for the KOOS-12 pain, function, QoL, and summary, respectively [2]. The test-retest of the KOOS-12 pain and function scales was moderate (ICC < 0.7). No previous study has reported the test-retest reliability of KOOS-12.

From KOOS-12 scales, the pain scale was the only one that did not present a good overall fit. Overall model misfit may be related to the reversed category boundaries, so the response option “monthly” will never be the most probable response for patients at any point on the trait scale. Low frequencies could explain reversed thresholds in the response options of item 1. The frequency of the second response option “monthly” was 7.18% and 11.96% for the third response option “weekly,” which were much less than 46.89% of the fourth response option “daily.” The low frequency of these categories may be due to the lower number of patients with early knee osteoarthritis. Therefore, evaluation of the pain scale could be appropriate with a higher number of patients with KOA grade 1 (Kellgren–Lawrence).

Correlation of KOOS-12 pain and KOOS pain was strong, indicating that the variance in the KOOS pain scale was enough as captured by the four items of the KOOS-12 pain scale. These results agree with the values of correlation coefficient reported in two previous studies (r = 0.89–0.93) [2, 7]. Results in the present trial have shown a very strong correlation (r = 0.94) between KOOS-12 function and KOOS ADL; this is consistent with previously reported data (r = 0.81–0.90) [2, 7]. We found a higher correlation between KOOS-12 function and KOOS sports and recreational activities (r = 0.80) compared with previous studies (r = 0.61–0.71) [2, 7]. The discriminant validity of the KOOS-12 was demonstrated by the low correlation between the KOOS-12 scales and the WHODAS 2.0 cognitive and ADL scales. Previously, the low correlation of KOOS-12 with the mental health scale of the SF-36® instrument was demonstrated [2]. KOOS-12 summary shows evidence of construct validity. In this study, KOOS-12 summary is highly correlated with the KOOS, IKDC, and ICOAP. Similarly, Eckhard et al. [7] has reported a KOOS-12 summary correlation with KOOS, WOMAC®, and Oxford-12 [7].

Dobson et al. [37] have reported that sit-to-stand tests with the best measurement evidence included the TUG test and the 30-s CST for KOA. In this study, the KOOS-12 scales showed a moderate correlation with the TUG test and 30-s CST. Our results are consistent with previously reported; the TUG test is negatively and moderately correlated with all the KOOS scales (r = − 0.66 to − 0.45) and with the Lequesne index [38, 39], and presents weak correlations with the WOMAC® (r < 0.3) [39, 40]. To the best of our knowledge, no previous study has correlated KOOS or KOOS-12 with the 30-s CST. However, the available information suggests a low-to-moderate relationship between self-reported and performance-based measures. Besides, performance-based measures and self-reported PROMs assess different patient characteristics.

The responsiveness of KOOS-12 demonstrated moderate to large effects eight weeks after intra-articular therapy. Furthermore, KOOS-12 scales and summary scale performed as well as KOOS’s pain, symptoms, and activities of daily living, and IKDC in terms of responsiveness. To the best of our knowledge, this is the first study examining the responsiveness of KOOS-12 in patients with KOA treated with intra-articular therapy. In patients with total TKR, the SRMs for KOOS-12 ranged from 1.62 to 2.12, and the quality of life scale was the most sensitive to change [2]. In contrast, in patients with intra-articular treatment, the pain and function scales were the most sensitive to change. The present study shows that KOOS-12 pain and summary were most sensitive to change than KOOS scales. In line with these results, KOOS-12 summary score reported high effect sizes, and standardized response means post-TKR [7, 8].

KOOS-12 is an easily accessible instrument for clinicians, it is freely available, easy to understand and score, besides, and the number of missing values is low. In clinical settings, KOOS-12 is a brief, comprehensive knee-specific PROM with good psychometric properties and provides an overall knee impact score, along with domain-specific measures.

Our study has some limitations that must be acknowledged. First, participants were recruited through secondary care clinics, suggesting that our sample may not be representative of KOA population. Second, the sample size was not equal in OA severity; the number of patients was lower in patients with mild and severe KOA. Third, even though the GPCM well fitted the data, the total sample size could be considered “inadequate” for accurate parameter estimates [11]. However, evidence from recent simulation studies suggests that a sample size of 200 participants is enough to achieve a robust CFA solution. The Monte Carlo data simulation techniques showed that adequate sample size for a one-factor CFA with four items and factorial loadings of 0.65 (as each scale of the KOOS-12) could be as low as 90 patients [41]. Concerning robust weighted least squares (WLS) estimation, the relative bias in the estimated standard errors of factor loadings depended on sample size and factorial loading magnitude, with a sample size of 200 participants and five-categorical data, and the relative bias for a four-indicator model was < 5% with loadings of 0.70 [42]. Although our study’s sample size is considered a “very good” sample size for estimating the parameters with the Rasch model [11], our results showed that KOOS-12 scales present a poor model fit to the Rasch-based partial credit model. Also, the same discrimination parameter of the items could not be assumed. The sample size of 209 patients might also be a limitation, as GPCM analysis generally requires ≥ 500 patients due to the number of parameter estimations needed [11]. These issues will be important to address in future research.

In conclusion, this study demonstrates that the Spanish version of KOOS-12 questionnaire is a valid, reliable, and sensitive to change instrument for measuring the patients’ opinion about their knee and associated problems in Mexican subjects with knee O.A.