Plain English summary

A 17-item measure of active living for older persons (OPAL) was developed by interviewing 148 older persons from four countries and in four languages, English, French, Spanish, and Dutch. Older persons identified active living as a way of being and the content of the measure is based on their input. To show that the measure is useable, it is necessary to test it in another sample of people and show that the items are important, hang together as a measure, and behave as expected with other measures. A large sample of people, 1612 from four countries were surveyed. All the items were rated as important and there is room for improvement in how much people are living actively. The results showed that the score on the OPAL measure related to other measures of health and activity but was also different from them. The maximum score on the OPAL measure is 51 and the average value in the sample tested was 33. People with breathing problems, fatigue, or feeling they are in poor health scored around 10 points lower out of 51 than people without these symptoms. A difference of 6 points out was found to be an important difference. These results indicate that the OPAL measure can be used to measure people at one point in time.

Background

In 2020, the Canadian Institutes of Health Research (CIHR) funded an international team to develop a measure to inform and evaluate active living programs for older persons. The World Health Organization has spearheaded a movement towards active living or active aging [1] which is defined as …the process of optimizing opportunities for health, participation, and security in order to enhance quality of life as people age. Active aging/living allows people to realize their full potential and to participate in society according to their needs, desires and capacities. How active living is operationalized into local programs or evaluated is often absent from the discussion.

An example of the challenges in evaluating community programs came from the experience of administering traditional measures of constructs related to health aspects of quality of life to a sample of older persons participating in a community outdoor walking program [2]. The participants were assisted to fill out the questionnaires and it was apparent from people doing the assisting that the items did not seem to relate to how this sample was currently living their lives. This resulted in many questions about how to respond to the items and even if there were right and wrong answers. There was confusion about terms that seemed similar (anxiety and stress), how to respond to questions about pain when pain is a background feature of their lives, and how to respond to questions about activities that they don’t do. This was real world evidence that our existing measures for older persons were not fit for the purpose of assessing how older persons were living their lives outside of the clinical context nor for evaluating the impact of interventions or programs targeting active living. The need for a new measurement framework was considered a worthwhile effort given the growing older adult population and the number of programs being developed that require evaluation.

This experience was not unique and a group of researchers working in the field of senior’s health and age studies became engaged in developing a new measure specifically designed to reflect how older persons were living their lives in their communities. The content development for this new measure previously has been described. Briefly, using best practices for content development a series of focus groups and individual interviews were conducted across four countries (Canada, United States, United Kingdom, Netherlands) and in four languages (English, French, Spanish, and Dutch).

The content development was framed from an anti-ableist perspective. The strong bias towards “fixing” aging and its inherent problems shares common ground with ableism [3], the bias towards considering people with disabilities as inferior as they deviate from an ideal, able bodied, existence, and are in need of fixing [4]. Titchkosky postulates that disability can be an identity which “makes it possible to insert into the world alternative ways of being and of knowing” [5]. Our thesis was that this framing could be applied to the situation of aging—an identity making it possible to have alternate ways of being and knowing. As emphasized by thought leaders on aging in the 1990s, Aging is not a problem to be solved rather a mystery to be lived, [6].

The underlying conceptual model for the active living construct was formative [7,8,9], that is, the items are considered to form the construct. Based on this perspective, the content development phase identified 17 “ways of being” that older persons, across these 6 country-language strata, identified as contributing to active living as an older person. These 17 “ways of being” were each evaluated on a four-point ordinal scale based on frequency (always or almost always, often, sometimes, rarely or never) not extent, to avoid an ableist framework.

Persons were asked to indicate how often in the past six months they felt: active, confident, connected, useful, creative, encouraged, energetic, involved, happy, healthy mentally, healthy physically, independent, interested, mentally sharp, and motivated. The results of the content development phase provided evidence that existing measures did not meet users’ needs. Many items were negatively worded and framed in an ableist view where persons are rated against an ideal. This first paper also demonstrated that the experiences of the target respondents were represented in the content of the measure.

Objectives

The purpose of this paper is to present further evidence of the extent to which this new measure, the Older Persons Active Living (OPAL) measure, is “fit-for-purpose” [10]. The objectives of this study are framed using American Educational Research Association’s Standards for Educational and Psychological Testing. American Educational Research Association [11] using the argument based approach to validity [12, 13]. The specifc objectives are to provide evidence that.

  1. (i)

    the items chosen are considered important to target responders and vary in frequency;

  2. (ii)

    the response categories are ordered in the expected linear fashion;

  3. (iii)

    the data derived from administering the items of the measure to the targeted sample fit the hypothesized structure of the measure;

  4. (iv)

    the scores from the measure under study correlate to the scores on a criterion measure or correlate with scores on measures that represent converging or diverging constructs; and

  5. (v)

    the scores on the index measure differ across categories of variables that are known to differ for this construct (e.g. age, sex, disease severity etc.).

Methods

This was a cross-sectional study of a population drawn from a participant panel, HostedinCanada (HIC). Briefly, the process is that HIC notifies individuals registered on the HIC platform of the survey and informs them of the eligibility criteria. Individuals who agree to receive more information about the study are provided electronically, by the HIC, with the study’s consent form. Those who agree electronically sign the consent form and are directed to the survey which is hosted on the HIC platform. English, French, Spanish and Dutch versions of the consent form and survey instrument were provided. A description of the privacy protections for individuals and for the data, taken from the HIC website (https://www.hostedincanadasurveys.ca/).

Population

The target population was people over the age of 65 years living independently in the community, without an illness requiring ongoing hospital treatment. The population was balanced across country and language strata. The size of the sample was estimated based on using confirmatory factor analysis using the rule of thumb of > 10 persons per item [14]. The number of items to be tested was 17 resulting in a minimum sample size estimate of 170. As we wanted this minimum sample size per country, the total sample size was set in proportion to the size of the populations in the four countries and to have a minimum sample size of 100 for each language. The smallest sample size was for Canadian French (n = 100); Canada, UK, English USA were each to have a total sample size of 400 (300 Canadian English); US Spanish and the Netherlands were to have 200 each for a total sample size of 1600 people.

Measurement

The measurement strategy for this study was informed by aging models which are strongly influenced by the concept of frailty [15,16,17], more positively framed as intrinsic capacity [18], as well as models of health-related quality of life (HRQL) [19, 20].

The survey instrument comprised the items from the OPAL measure rated for importance on a 5-point ordinal scale (extremely, quite a bit, moderately, a little bit, not at all) and rated on frequency on a 4-point ordinal scale (always or almost always, often, sometimes, rarely or never). These response options were chosen as they have been shown to have linear properties [21].

In addition to the items from the OPAL measure, the survey included content related to demographics, comorbidity, impairments and symptoms that are common in the older population, and indicators of well-being, physical and cognitive capacity, and health. Apart from single items, the survey included the Lower Extremity Functional Scale (LEFS [22]) and Communicating Cognitive Concerns (C3Q) [23], measures calibrated using Rasch analysis allowing for the use of shorter versions. Also used were the Well Being Index [24] and the EQ-5D-5L [25].

Analyses

Distributional parameters were used to characterize the sample. Distributions for the importance ratings and the frequencies were presented separately for men and women. As the sample size was large, only differences between men and women of 10% or more were considered of importance [26]. The effects of strata, age, and gender on importance ratings and frequencies were assessed using logistic regression. For strata, the referent category was US English; for age, it was the youngest category, 65–70 years, and for gender men were compared with women.

The ordinality of the item responses for frequency was tested by fitting the data to the Rasch Model yielding indications of disordered thresholds and by viewing the category probability curves. Differential item functioning (dif) was tested for strata (country/language segment), age, and gender using analysis of variance and visual inspection of the magnitude of the effects. The structure of the measure was assessed using item-to-item correlations for both importance ratings and frequencies. Structural equation modeling (SEM) using mPLUS was used to identify the structure of the measure. Exploratory factor analysis (EFA) was used and fit statistics were estimated for one to seven factor models using the interpretations from Schermelleh-Engel et al. [27]. To identify how to create a scoring method for the 17 items, a principal component analysis (PCA) was conducted.

The distribution of OPAL scores, calculated as an unweighted sum of the ordinal ratings, was assessed for floor and ceiling effects. The relationship between scores on the OPAL measure and the converging constructs of physical, emotional, and psychological health and HRQL were assessed using Spearman correlation. As the OPAL measure taps a completely new construct, correlations with converging constructs were expected to be moderate (~ 0.5). The effects of selected sociodemographic and impairment related variables on OPAL scores were described using calculated differences and effect sizes and adjusted estimates of difference through linear regression adjusted for age, sex, and strata.

Results

Characteristics of the sample

Table 1 presents a description of the sample according to socio-demographic characteristics. Table 2 presents a desciption of the sample on functional, health, and social support indicators. The sample was predominantly between 65 to 75 years of age and of white European ancestry (87%). The education of the sample was evenly distributed across high school or less, certificate programs, or bachelor’s degree, with 11% with post graduate degrees. More than half (54%) declared they had none of the listed health conditions; the most common health conditions listed, ranging from 15 to 9%, were cardiovascular disease, musculoskeletal and respiratory diseases; 10% of people reported that they did not always have money to meet their needs.

Table 1 Socio-demographic characteristics of the sample
Table 2 Characteristics of the sample on functional indicators

Table 2 presents characteristics of the sample on functional indicators. A small proportion of people had low vision (0%) or low hearing (8%) and 27% reported pain interfering with daily activities. Sixty percent (60%) of the sample reported having indicators of positive well-being, a smaller proportion 45% reported they woke up feeling fresh and rested, and 37% reported feeling active and vigorous. For physical function, 30% reported they would have difficulty walking for 30 min.

Evidence for importance of content

Table 3 shows importance and frequency ratings for men and women separately. To simplify the presentation and the analysis, the 5-level response option for importance was dichotomized to extremely or quite important vs. moderately, a little bit, or not at all imporant. Similarly, the 4-level response option for frequency was dichotomized at always or almost always vs. sometimes or rarely/never.

Table 3 Item specific importance and frequency ratings

Across the “ways of being” items, importance ratings ranged between 60 and 90%. There were no differences between men and women of magnitude 10% or more. The top 3 important “ways of being” for men and women were mentally sharp, independent, and healthy mentally. The least important items were creative and active. Average importance values across the 5-level ordinal rating scale ranged from a low of 2.66 (creative) to a high of 3.45 (mentally sharp) yielding an importance ratio of highest to lowest of 1.29.

As there were 17 items and 6 strata, there were a total of 17*5 item-strata comparisons of importance (n = 85). Of these, there were only 4 (4.7%) where the importance rating was higher for one strata compared to US English: encouraged, USA Spanish (OR 1.6; 95% CI 1.1–2.4); involved, Netherlands (OR 1.5; 95% CI 1.0 to 2.2); motivated, Canadian French (OR 1.8; 95% CI 1.0–3.3) and UK (OR 1.4; 95% CI 1.0 to 2.0). This suggests that the harmonizaton process yielded items that were equally relevant across these strata except for a few exceptions that might have been expected owing to the large number of comparisons made. There were six gender differences by statistical criteria but none of the crude gender differences were quantitatively different. There were four age group differences with older age consistently endorsing higher importance. Full details are presented in Supplementary Table 1.

For the 85 item-strata frequency comparisons, there were 25 (29.4%) where there were differences across strata with 21 differences showing less endorsement of higher frequency of active living items in countries other than the USA. Full details presented in Supplementary Table 1.

Evidence for response categories

The response category for the frequency ratings was chosen from the list recommended by Mutebi et al. [21] as having linear properties. The ordinality and linearity was tested by fitting the response data to the Rasch Model. There were no disordered thresholds and visualization of the category probability curves and the score structure map supported linearity. Figure 1 shows how the different response options are distributed across a standardized OPAL score presented along the x-axis. The ordinal scoring structure has been converted to a 1 to 4 range rather than the original 0 to 3 structure (1: rarely or never: 2: sometimes;. 3: often; 4: always or almost). With 4 response options there are three thresholds which are the boundaries between 1 and 2, 2 and 3, and 3 and 4. The width of the bars is quite similar showing that distance between the categories is quasi-equal.

Fig. 1
figure 1

Distribution of item categories over the OPAL items

Evidence for structure

Item-to-item correlations (n = 136) for importance ratings and frequencies are presented in Table 4. Thirteen pairs of importance items were correlated at 0.8 to < 0.87 and an additional 46 were correlated at 0.7 to < 0.8; the number of similarily strongly correlated frequency item-pairs were 12 and 47. Apart from “creative” all other items had some degree of strong correlation with other items. The item-to-item correlations for frequency inform the extent to which the items measure the same construct. The average correlation across items was 0.67 (SD: 0.08) with a range from 0.39 to 0.86.

Originally, the OPAL measure was conceptualized as measuring a contruct similar to HRQL with domains related to physical, emotional, social, and psychological health. A confirmatory factor analysis (CFA) failed to provide a solution and hence an EFA was used. The scree plot, shown in Fig. 3, indicated that a two factor model was supported by the data with the first eigenvalue for the sample correlation matrix of 11.7, the second 1.1 and the third, 0.9. The 7-factor model showed good fit according to all the criteria and is presented in Table 5. This model showed only 1 item with a cross-factor loadings (> 0.3). There were four 3-item factors, two 2-item factors, and one 1-item factor. Interestingly, the item groupings made theoretical sense but the number of items per factor is below the recommended number for reliable domain representation [28, 29]. As the inter-item correlations presented in Table 4 were all high as were the item-to-total correlations mean 0.77), the evidence is not overwhelming for a measure with mutiple domains. The PCA showed that the first PC explained 68.9% of the variance, indicative of one strong component. The PCA weights were similar across the 17 items: mean 0.82 (SD:0.05); range 0.67 to 0.90.

Table 4 Item-to-item correlations among importance and prevalence ratings
Table 5 Factor structure of the OPAL measure

Evidence for OPAL scores and other constructs

Based on the linearity of the response options for frequency [21] (see Fig. 1) and the relatively similar importance ratings for each item (see Table 3), a simple linear sum was considered mathematically justified. Figure 2 shows that this approach yielded a distribution that was near normal.

Fig. 2
figure 2

Distribution of OPAL scores

Table 6 shows that correlations with the constructs related to physical function (LEFS), self-reported cognitive ability (C3Q), and HRQL (EQ-5D value) were all in the expected moderate range (~ 0.5).

Table 7 shows how the OPAL measure differs across constructs known to affect participation and HRQL. The OPAL score across all strata was 33.1 (SD: 11.5) with a maximim score of 51 (17 items scored on a 4-point scale, 0 to 3). There were only small differences in frequency across strata. There were no differences between men and women. The oldest groups (76 + years) scored slightly higher than the youngest group (65–70 years),

The adjusted differences between categories of variables hypothesized to influence active living ranged from − 3.10 (low vision) to a high of − 11.82 for being out of breath with ordinary activities. Given the overall SD was 11.5, these differences yield very large effect sizes. A medium effect size (1/2 SD) was observed for low hearing (− 6.36) and having a health condition (− 6.66) suggesting a Minimum Important Difference (MID) of 6. Large effect sizes, ≥ 0.8 were observed for pain, fatigue, and shortness of breath.

Discussion

The results of this study provide evidence that the 17-item OPAL measure is fit for the purpose of estimating the extent to which older persons are living actively at one point in time. Content representation was previously demonstrated and evidence presented in Table 3 shows homogeneity of importance ratings across population strata suggesting that the language versions were similar. The linearity of the response options was confirmed using item-characteristic curves, visualization of the threshold map (Fig. 1) and there was lack of differential item functioning across strata, age, and sex.

There was evidence that the items related to the same underlying construct as the item-to-item correlations (see Table 5) were majoritarily strong. The strong item-to-item correlations also suggest item redundancy and shorter versions may be possible and will be the topic of future research.

Although statistically a seven-factor model fit the data (see Table 4), each factor comprised very few items and item-to-item correlations were very high suggesting that one factor is plausible. In addtion, although the OPAL measure is conceptualized as a formative model, all items fit the Rasch model suggesting unidimensionality.

These results provide evidence that a simple linear sum of the responses to the 17 items is a mathematically valid way of deriving a total score. While there were some differences in importance ratings, the ratio of the highest to lowest importance was 1.29.

The OPAL measure showed a low ceiling and floor effect with 6.2% of values at the maxumum and only 5 values at 0. The distribution was near normal with a mean of 33.1 (out of 51) and a SD of 11.5 as shown in Fig. 2. Correlations with other measures of converging constructs were of moderate strength (~ 0.50; see Table 6) and the differences across groups known to affect functioning and health were observed suggesting an MID of 6 out of 51 or near 12 when scores are converted to be out of 100 (see Table 7).

Table 6 Relationships between scores on OPAL and converging constructs
Table 7 Differences in OPAL scores across functional status indicators

The findings from the validation analyses are typical of measures in the field. For example, respondents during the testing of the Older People’s Quality of Life questionnaire (OPQOL-brief) [30], comprising 13 items rated on a 1–5 agreement scale (higher is better), showed an average response value for each item ranging from 3.94 to 4.39. For the OPAL measure, with responses ranging from 0 to 3, the proportion responding 2 or 3 ranged from 15.5% (encouraged) to 55.6% (independent) as shown in Table 3 providing some evidence that the OPAL measure has a lower ceiling effect than the OPQOL-Brief. Item-to-total correlations for the OPQOL measure ranged from 0.36 (I get pleasure from my home) to a high of 0.67 (I enjoy my life overall). For the OPAL measure, the item-to-item correlations were higher ranging from 0.71 to 0.85, supporting the evidence from the factor analysis (Table 4) that a one-factor model is plausible. The total score on the OPQOL-brief was moderately correlated with variables hypothesized to influence QoL such as self-rated active aging (ρ 0.503) and self-rated health status (ρ 0.517). Our values for similar constructs were 0.50 for active living and 0.56 for EQ-5D. Similarly for physical function, correlation with OPQOL-bref was 0.43 and for OPAL and physical function, the correlation of 0.51 (Fig. 3).

Fig. 3
figure 3

Scree Plot from exploratory factor analysis

The PROMIS-29 is another measure proposed for use among older adults with multiple chronic conditions [31]. The evidence they present is that there were strong differences in the physical and mental health scores (PHS; MHS) across groups expected to differ. Some of these differences make relevant comparisons to validity evidence presented for OPAL. Scores on PROMIS-29 decreased according to number of chronic conditions with similar magnitude to what was found for OPAL for people reporting a health condition (see Table 7). The other comparisons are less relevant as OPAL was developed for use by those living independently in their community. There was evidence presented that PROMIS-29 PHS correlated with physically associated scales (similar to item-to-total correlations) with ranges 0.63 to 0.81 similar to the item-to-item correlations for OPAL (see Table 4).

Not every type of validity evidence was possible to asses in this study. Further work is needed to derive evidence for test–retest reliability and how scores change in response to intervention. Generalizing the results of this study ouside of these four settings and languages also needs the relevant evidence. The development of a measure is an ongoing process and is based on the accumulation of evidence that the data arising from the measure is fit for intended purposes. “Validity” does not belong to the measure, it is a property of the data arising from the measure.

What construct does the OPAL measure? According to the ISOQOl dictionary [19] HRQL is a term referring to the health aspects of quality of life, generally considered to reflect the impact of disease and treatment on disability and daily functioning. This is in contrast to quality of life which, using the ISOQOL dictionary definition from the World Health Organization refers to an individuals’ perception of their position in life in the context of the culture in which they live and in relation to their goals, expectations, standards and concerns. The OPAL measure would seem to measure older persons active living-related quality of life (OPALQOL).

Conclusion

The results of this study provide evidence that the 17-item OPAL measure is fit for the purpose of estimating the extent to which older persons are living actively at one point in time. Further work is needed identify how the OPAL measure performs over time and as a way to assess an intervention and whether shorter forms would provide similar evidence. Research among older persons from other cultures and languages would provide needed evidence as to the applicability of this active living construct globally.