Background

Medication Treatment Satisfaction (M-TS) is a subjective patient-reported outcome (PRO) that evaluates patients’ perception of medication-taking process and its associated outcomes [1]. Both the US Food and Drug Administration (FDA) [2] and the European Medicines Agency (EMA) [3] encourage involving patient-reported satisfaction with treatment in drug development and evaluation, with an emphasis placed upon patients’ judgment.

M-TS, if captured in a scientifically rigorous way [3, 4], can predict adherence (e.g., by identifying areas where patients are dissatisfied with their medication) [5], inform clinical decision-making (e.g., by allowing health care professionals to select therapies based on patient feedback) [3, 6,7,8], and influence health care policy (e.g., by guiding reimbursement decision and quality improvement initiatives based on patient-centered outcomes) [3, 6,7,8]. Patient-reported outcome measures (PROMs) with poor validity, reliability, or responsiveness in the target population may inadequately capture the changes in M-TS [9, 10] resulting in inaccurate estimates of the effect of drugs and misguided clinical decisions [9,10,11]. With the growing breadth of available PROMs for M-TS [12,13,14,15], heterogeneity in outcome reporting has stifled efforts to synthesize findings across trials [16, 17].

Identifying valid, reliable, responsive, and interpretable PROMs for M-TS therefore is crucial for clinical trials and clinical practice [7, 8, 10]. To date, no study has systematically reviewed existing PROMs for M-TS and assessed their measurement properties. Our systematic review aims to identify currently available PROMs for M-TS, to evaluate the measurement properties of these PROMs, to provide evidence for choosing PROM for M-TS, and if any to highlight the research gap regarding the development and validation of PROMs for M-TS.

Methods

We registered this systematic review on the Open Science Framework (https://doi.org/10.17605/OSF.IO/8S5ZM) and adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) reporting guideline [18] and the Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) guideline for systematic reviews of patient-reported outcome measures [19,20,21]. The COSMIN guideline provides a standardized data abstraction form for characteristics and measurement properties of the included PROMs, and criteria for assessing the measurement properties of PROMs. The reviewers used the COSMIN Risk of Bias checklist to assess the risk of bias of individual studies, and the modified Grading of Recommendations Assessment, Development, and Evaluation (GRADE) to grade the quality of evidence.

The research team formation

We established a multidisciplinary research team, comprising two PROM methodologists, one epidemiology methodologist, three clinicians, three pharmacists, and two pharmacy students, to ensure comprehensive analysis and diverse perspectives in this review.

Literature search and selection

Using PubMed, Embase (Ovid), Cochrane library (Ovid), International Pharmaceutical Abstracts (IPA, Ovid), PsycINFO, Patient-Reported Outcome and Quality of Life Questionnaires biomedical databases (PROQOLID), China National Knowledge Infrastructure (CNKI), Wanfang, Chinese Scientific Journal Database (VIP database), and Chinese Biomedicine Literature Database (CBM) (from inception through 5 December, 2022) and ePROVIDE (https://eprovide.mapi-trust.org/), the reviewers performed a systematic search for literature published in English and Chinese reporting the development (i.e., development study including cognitive interview or other pilot study) and validation (i.e., validation study) of PROMs for M-TS in adults and children with any medical condition (Additional file 1: table S1).

After removing duplicate records, pairs of reviewers (M.Y., X.J., W.Y., and S.Z.) independently screened the titles and subsequent full-text articles, with conflicts handled by a fifth reviewer (L.Z.). For including relevant literature, the reviewers reviewed the references of included articles.

Data extraction

Using a pilot tested data abstraction form, after a calibration exercise, pairs of reviewers (M.Y., X.J., W.Y. and S.Z.) independently extracted data including the characteristics of the individual studies (e.g., study design, country, sample size, study population), the characteristics of the PROMs (e.g., target population, domains, response options, and copyright based on ePROVIDE), and the measurement properties of the PROMs (i.e., content validity, structural validity, construct validity, criterion validity, cross-cultural validity or measurement invariance, internal consistency, test–retest reliability, measurement error, and responsiveness) and information about interpretability of the PROMs [19, 22].

For the PROM development, we extracted the origin of the construct to be measured reported in the study (e.g., a theory, conceptual framework, or disease model used, or a clear rationale provided to define the construct). Then, after group discussion within our research team, we summarized the concepts, components, and influencing factors of treatment satisfaction into a table.

Assessment of measurement properties

Using the criteria for good measurement properties [19, 20], two reviewers (M.Y. and P.Z.) based on individual studies independently assessed the measurement properties of each PROM as sufficient ( +), insufficient ( −), or indeterminate (?) (Additional file 1: table S2 [19, 20, 22]). For example, we rated the test–retest reliability of a PROM as sufficient if the intraclass correlation coefficient (ICC) or weighted Kappa ≥ 0.70, as insufficient if < 0.70, or as indeterminate if ICC or weighted Kappa were not reported. Based on all studies relevant to a particular PROM, the reviewers assessed the overall measurement properties of that PROM as sufficient ( +), insufficient ( −), inconsistent ( ±), or indeterminate (?) [19, 20]. A third reviewer (L.Z.) resolved any disagreement.

Grading the quality of evidence

Using the COSMIN risk of bias checklist [19, 23], two reviewers (M.Y. and P.Z.) independently assessed the risk of bias (RoB) of individual development and validation study as very good, adequate, doubtful, inadequate quality. For example, we rated RoB of a PROM development study as very good if PROM design (including construct to be measured, origin of the construct, target population, context of use etc.) and cognitive interview study or another pilot test regarding the relevance, comprehensibility, and comprehensiveness of the PROM were clearly described; adequate if assumably appropriate but not clearly described; and doubtful or inadequate if not clearly described. A third reviewer (L.Z.) resolved any disagreement.

Using the modified GRADE approach in COSMIN guideline, the reviewers graded the overall quality of evidence on the measurement properties of a PROM as high, moderate, low, or very low (Additional file 1: table S3 [20,21,22, 24]) [19, 24]. When a specific measurement property was assessed as indeterminate (e.g., due to lack of reporting), the reviewers did not rate the quality of evidence on that particular measurement property [19, 20].

Results

Literature screening and characteristics of included studies

The search strategy yielded 2813 records, with 114 studies [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138] included in this review (Fig. 1). Sixty-three studies (55%) pertained to the development and validation of PROMs for M-TS, while the remaining 51 studies (45%) focused on the validation of PROMs (Additional file 1: table S4 [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138]). The United States accounted for the majority of studies conducted (46, 40%), followed by Spain (16, 14%) and the UK (15, 13%). The median sample sizes of studies that developed and validated the PROMs is 205 (with a range from 13 to 1336) and that of studies that validated the PROMs is 197 (with a range from 10 to 2511).

Fig. 1
figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analysis Diagram of Study Selection

Characteristics of PROMs for M-TS

The 114 studies [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138] reported 69 PROMs for M-TS. Sixty PROMs (87%) were intended for adults (Table 1). Most of the PROMs are disease-specific (32, 46%) or drug-specific (33, 48%), while four are generic (6%). The disease-specific PROMs cover 12 categories of diseases under the International Classification of Diseases 11th (ICD-11) [139] with 29 specific diseases (mostly often diabetes, asthma, and migraine). The drug-specific PROMs targeted 12 categories of medicines under the Anatomical Therapeutic Chemical Classification System (ATC) [140] (mostly often anticoagulants, insulin, and iron chelation).

Table 1 Characteristics of PROMs for medication treatment satisfaction

The majority of M-TS PROMs (61, 88%) were self-reported, four (6%) were either self-reported or proxy-reported (e.g., by parents or clinicians), and the remaining four (6%) were health practitioner administered (e.g., through interviews). Data collection modes for self-reported PROMs include questionnaires (in paper and pen or electronic versions), interviews (face-to-face or via phone script), and mobile applications. All included 69 M-TS PROMs have questionnaire data mode; only the TSQM-1.4, TSQM-II, and TSQM-9 (3, 4%) have further developed phone scripts and mobile applications [141]. In practical applications, patients can independently complete self-reported M-TS PROMs on paper or electronic devices, or with the assistance of researchers, who will read the questions aloud and record the patient’s answers (e.g., in-person or telephone interviews). Thirty-eight PROMs (55%) reported copyright information, of which 28 belong to the pharmaceutical industry. Sixty-four PROMs (93%) provided free access to full questionnaires. Ten PROMs (14%) proved easy to be administrated and completed in the context of clinical trials.

The most commonly measured domains were convenience (40, 58%), side effects (39, 57%), and perceived effectiveness of medication (29, 42%) (Table 1). The most commonly used response option was a 5-point Likert scale (32, 46%). Less than half of the PROMs (32, 46%) clarified the timing for measuring M-TS. Among those clarified, the most common recall period for measuring M-TS was 2 to 4 weeks after the initiation of medication.

Measurement properties of PROMs for M-TS with quality of evidence

Among sixty-four PROMs reported the development process (Table 2). All of these 64 PROMs described the construct measured and the target population. Ten reported the conceptual framework or theory for defining the construct being measured that supported their generation of measurement items (Additional file 1: table S5 [1, 30, 44, 66, 82, 95, 100, 117, 125, 126, 129, 142]). No common framework or theory was used. Figure 2 summarizes the concepts, components, and influence factors of treatment satisfaction from the ten frameworks [30, 44, 66, 82, 95, 100, 117, 125, 126, 129] and two theories [1, 142].

Table 2 Development and content validity of PROMs for medication treatment satisfaction
Fig. 2
figure 2

The concept, component and influence factors of medication treatment satisfaction from current conceptual frameworks

Eight PROMs (12%) had sufficient overall content validity (i.e., relevance: items were relevant for the construct of interest, the target population, and the context of use; comprehensiveness: all key concepts were included; comprehensibility: the PROM was understood by the target population) (5/8, moderate; 3/8, low quality of evidence) (Table 2). The other PROMs failed to simultaneously meet the criteria of relevance, comprehensiveness, or comprehensibility or did not report the content validity. Among 54 PROMs that reported structural validity (all of which applied classical test theory), 13 had sufficient structural validity (i.e., comparative fit index (CFI) or Tucker–Lewis index (TLI) or comparable measure > 0.95, or Root Mean Square Error of Approximation (RMSEA) < 0.06 or standardized root mean residuals (SRMR) < 0.08) (high to moderate quality of evidence). The others did not report or meet the criteria for structural validity (Table 3 and Additional file 1: table S6 [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138]) [19, 20]. Among 56 PROMs that reported construct validity, 39 had sufficient convergent validity (e.g., correlations with instruments measuring similar constructs ≥ 0.50) (39/44, 89%; 38/39, high to moderate; 1/39, low quality of evidence) and 43 had sufficient discriminative or known-groups validity (e.g., correlations with instruments measuring unrelated construct < 0.30) (42/45, 93%; 41/43, high to moderate; 2/42, low quality of evidence).

Table 3 Measurement propertiesa of PROMs for medication treatment satisfaction

Among 63 PROMs that reported internal consistency, 11 PROMs demonstrated sufficient (i.e., Cronbach’s alpha(s) ≥ 0.70 and at least low evidence for sufficient structural validity) (11/63, 17%, high quality of evidence). Thirty-eight PROMs reported test–retest reliability, of which 24 had sufficient test–retest reliability (i.e., ICC and weighted Kappa ≥ 0.70) (24/38, 63%; 18/24, high to moderate; 6/24, low to very low quality of evidence). The others did not report or meet the criteria for test–retest reliability.

Sixteen PROMs reported responsiveness, of which six drug or disease-specific PROMs demonstrated sufficient (e.g., area under the ROC Curve (AUC) ≥ 0.7). No generic PROMs had sufficient responsiveness.

Four PROMs for M-TS [60, 78, 80, 87] proposed the minimal important difference (MID) (Additional file 1: table S7 [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138]). Four PROMs demonstrated normally distributed scores in the study population while 11 PROMs showed negatively or positively skewed scores. Thirty-two PROMs for M-TS reported floor or ceiling effects.

PROMs for M-TS with sufficient validity and reliability

One drug-specific PROM (Insulin Treatment Satisfaction Questionnaire, ITSQ) [42, 43] demonstrated sufficient construct validity, internal consistency, test–retest reliability and responsiveness (high to moderate quality of evidence), and content validity (low quality evidence). Two generic PROMs (TSQM-1.4 [25,26,27,28,29] and SATMED-Q) [31, 37,38,39,40] demonstrated sufficient construct validity, structural validity, and internal consistency (high quality of evidence) and content validity (moderate to low quality of evidence), but lack of evidence on test–retest reliability or responsiveness.

Discussion

Summary of findings

This review systematically searched and evaluated current PROMs for M-TS. Over 85% of the PROMs targeted adult patients. Most PROMs demonstrated sufficient construct validity including convergent validity (39/69, 57%) and discriminative or known-groups validity (40/69, 58%) (high to moderate quality of evidence) but failed to demonstrate sufficient content validity (61/69, 88%), structural validity (56/69, 81%), internal consistency (58/69, 84%), or test–retest reliability (45/69, 65%). Few PROMs reported responsiveness (16/69, 23%). Only four PROMs provided an approach for interpreting the results of the PROMs (i.e., the MIDs).

Introduction of the three PROMs for M-TS with sufficient validity and reliability

The ITSQ [42, 43] is a drug-specific PROMs for insulin treatment satisfaction including domains of inconvenience of regimen (5 items), lifestyle flexibility (3 items), glycemic control (3 items), hypoglycemic control (5 items), and insulin delivery device satisfaction (6 items). According to the Allie database (i.e., a database for searching studies in PubMed and MEDLINE that involve a particular abbreviation and long form) [143], 12 trials in adults have applied the ISTQ. Due to unclear recall period and lack of evaluation on content validity, the quality of evidence on the content validity of ITSQ was still low.

Both TSQM-1.4 [25,26,27,28,29] (with 14 items) and SATMED-Q [31, 37,38,39,40] (with 17 items) are generic PROMs for adults with chronic diseases. The two PROMs shared four common domains including perceived effectiveness, side effects, convenience, and global satisfaction. The SATMED-Q had two additional domains (i.e., impact on daily living or activity, process of medical care or medical follow-up). The TSQM-1.4 was the most widely used PROMs for M-TS (106 trials applied, 4 of which were in children and adolescents) while SATMED-Q was rarely used in trials (ten trials in adults applied during the last 15 years) [143]. Both PROMs still lack high quality of evidence on content validity and responsiveness.

Limitations of current PROMs for M-TS

Although conceptual framework is not mandatory for developing measures, providing a theoretical foundation facilitates defining key components and their relationships within the construct [6, 7]. Our study revealed that over 70% PROMs did not define a conceptual framework for treatment satisfaction. The existing frameworks draw heavily from Shikiar’s pyramid theory [1] and Weaver’s concept of treatment satisfaction [142]. These frameworks, however, have limitations on considering the difference between patients with chronic disease and those with acute disease toward treatment satisfaction (e.g., chronic disease patients tend to emphasize long-term outcomes and quality of life, while acute disease patients prioritize immediate relief and symptom management) [144] and the difference between adults and children (e.g., children might prioritize medication side effects and ease of administration, whereas adults focus more on the effectiveness) [145]. These differing priorities highlight the need for PROMs that are tailored to specific disease contexts to accurately capture patient satisfaction. More than half of the PROMs (41/69, 59%) did not clearly report the process of cognitive interviews or other pilot tests of the PROMs damaging the transparency and rigor of the development process [146,147,148].

Regarding content validity, because the context of use was vague, we rated the relevance of most of the PROMs as indeterminate (e.g., whether the PROMs is for clinical trial or clinical practice, for discriminative, evaluative, or predictive purpose was unclear); due to paucity of justification on the appropriateness of response option and recall period, we rated the comprehensiveness as insufficient; due to lack of justification on the understandability of the PROMs in target population, we rated the comprehensibility as insufficient (Additional file 1: table S6 [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138]).

Our systematic search found six PROMs for M-TS were developed specifically for children and adolescents. These PROMs, however, lacked of evidence on sufficient validity, reliability, or responsiveness [51, 101, 102, 134,135,136]. Compared with adults children and adolescents have limited vocabulary, comprehension, and self-awareness, which probably influence their ability to respond accurately to adult-oriented measures [149]. The validity, reliability, responsiveness, and feasibility of the PROMs developed for adults should be evaluated before they are applied in children and adolescents.

According to Allie [143], we found most PROMs were infrequently used with 38% never applied by any trial and 42% applied by less than 5 trials. Patient and Partner Treatment Satisfaction Scale in Erectile Dysfunction (TSS), the second most often used PROM (the first was TSQM), had inconsistent content validity (moderate quality of evidence) (Table 2). The infrequent application of the PROMs for M-TS and wide use of PROMs with poor measurement properties indicates that potential stakeholders (e.g., clinicians or researchers) probably lack awareness and access to validated PROMs for M-TS.

The practical application of self-reporting in M-TS PROMs faces several challenges, particularly for specific populations. Individuals with mental health issues may struggle to report treatment satisfaction accurately due to cognitive impairments or emotional distress [150]. Those with limited digital skills may find electronic PROMs difficult, resulting in incomplete or biased data [151]. Similarly, individuals with reading difficulties or low literacy may misinterpret questions, compromising response reliability [152]. These challenges underscore the need for inclusive PROM design, ensuring accessibility and comprehension for diverse patients. Alternative data collection methods, such as interviews or caregiver reports, should be considered for those unable to self-report reliably.

Strengths and limitations of this systematic review

We conducted a comprehensive search for current PROMs for M-TS. This review included four Chinese databases, which expanded the scope beyond English-language publications to provide a more comprehensive understanding of M-TS across different cultural contexts and enhance the diversity, comprehensiveness, robustness, and applicability of our systematic review [153]. Following the COSMIN guideline [19, 20] we assessed the measurement properties of the PROMs, risk of bias of individual studies, and rated the quality of body of evidence on the development and validation of the PROMs.

This review has some limitations. First, we only included studies published in English and Chinese and might have missed PROMs for M-TS reported in other languages. Second, poor reporting of individual studies impeded our ability to assess the measurement properties and to rate the quality of evidence. To minimize the impact of ambiguous reporting, we searched for and abstracted data from all available literature for each included PROM. Third, some evaluation criteria for measurement properties were subjective (e.g., for the criteria for content validity: “are all key concepts included?”). We attempted to reduce the difference between reviewers by conducting calibration exercises, duplicated assessments and group discussions when discrepancy occurred. Fourth, we did not recommend PROMs based on the COSMIN guideline criteria (i.e., PROMs that have potential to be recommended as the most suitable PROM for those with evidence for sufficient content validity (any level) and sufficient internal consistency (at least low level) [19, 20]) because we think the evaluation of content validity was subjective. We, however, based on the evaluation of measurement properties, highlighted three PROMs with sufficient validity (particularly construct validity) and reliability (particularly internal consistency).

Recommendations for future research and clinical practice

Confident use of current available PROMs for M-TS will require validation study to assess their measurement properties (especially content validity, test–retest reliability, and responsiveness), interpretability, and feasibility in different context and population. Further studies can pay more attention to assess the measurement properties of current PROMs in children and adolescents or to develop PROMs targeted at this population. Although COSMIN provided a reporting guideline for validation study [154], our review found that the guideline was not widely used. A uniformed standard for reporting the development and validation of PROMs is needed to improve the reporting quality of studies on the development and validation of PROMs for M-TS. Additionally, collaborative efforts among researchers, clinicians, and policymakers are essential to develop and implement robust M-TS PROMs applicable to diverse patient groups. This will bridge the gap between current PROMs and patient needs, enhancing patient care and treatment outcomes.

The implementation of validated and reliable M-TS PROMs in clinical practice can significantly enhance patient care by providing healthcare providers with better tools to assess and address patient satisfaction and treatment outcomes. By accurately capturing patient experiences and preferences, PROMs can help tailor treatments to individual needs, leading to improved adherence and better health outcomes. Moreover, the use of PROMs can facilitate shared decision-making, empowering patients to be active participants in their own care. This patient-centered approach not only improves satisfaction but also contributes to more effective and efficient healthcare delivery. Ensuring that PROMs are culturally sensitive and accessible to all patient groups, including those with mental health issues, limited digital skills, and reading difficulties, is crucial for their widespread adoption and utility in diverse clinical settings.

Conclusions

Most current PROMs for M-TS demonstrated sufficient construct validity while only a few had sufficient content validity, structural validity, internal consistency, and test–retest reliability. Few PROMs reported responsiveness. Confident use of current PROMs requires further evaluation on the validity, reliability, responsiveness, and interpretability of current PROMs. Reporting guidelines are needed to enhance the reporting quality of the development and validation of PROMs for M-TS.