Introduction

Social participation is an indicator of an individual’s integration in their community and is an important component of quality of life. Studies have shown that people who are well integrated in a group or community are buffered to some degree from the impact of daily stressors and non-routine crises that characterize human life [18]. This research suggests that social participation may have a legitimate place in health promotion and public policy, especially in relation to mental health [9]. However, to confidently establish the importance of social participation for health, and to identify which aspects of social participation are most critical, further research on the psychometric properties of existing social participation measures is necessary.

This study aims to examine the internal validity of a modified version of the Social Participation Index (SPI) [2] using Rasch analysis. The modified measure—the Social Participation Questionnaire (SPQ)—was tested to assess how well responses on the SPQ fit with the expectations of the measurement model. The analysis evaluates the validity of the category scoring system, the fit of individual items, and assesses of the potential bias of items by response dependence, gender and age.

Method

The SPI was developed for use in The Adelaide Health Development and Social Capital Study [2]. Items in the SPI were selected by Baum et al. [2] on the basis of face validity and their knowledge of the social capital literature. They adopted a broad interpretation of social participation, and the SPI includes 29 items from the following six categories: informal social contact (three items; e.g. visiting family members); social contact through activities in public spaces (four items; e.g. a visit to a café or restaurant); participation in group activities such as sports or hobbies (six items; e.g. gym or exercise class); participation in civic activity undertaken on an individual basis (seven items; e.g. signing a petition or writing to a politician); civic participation involving group activity (four items; e.g. trade union or residents action group); and participation in community groups that involve a mix of social and civic purposes (five items; e.g. service clubs). The SPI has not undergone psychometric testing.

In the current study, the SPI was modified by removing two categories and adding four new items. Participation in civic activity undertaken on an individual basis and civic participation involving group activity were both removed from the questionnaire on the basis that earlier work of Baum and her colleagues found that endorsement for these categories was very low, and that they had a little association with mental health [2, 10]. Two items (Gone to a club, pub or bar and Gone to watch a sports event) were added to the section on social contact through activities in public spaces (Table 1), and one item was added to the section on participation in group activities (Had social contact through children’s sport). Finally, one item was added to the category informal social contact asking about using the internet for social communication. These additional items were drawn from a list of social activities collated by Australian Bureau of Statistics and were considered to capture social activities frequently undertaken by Australian adults [11].

Table 1 Final fit of the revised questionnaire items to the Rasch analysis

The modified 22-item version of the questionnaire was called the SPQ (Table 1). Eighteen items in the SPQ asked respondents to indicate how regularly they had undertaken each activity in the last 12 months. They could choose one of the six response categories: “Never”, “Rarely”, “A few times a year”, “Monthly”, “A few times a month”, and “Once a week or more”. The remaining four items requested a binary “Yes”/“No” response regarding participation in community groups that involved a mix of social and civic purposes in the last 12 months.

Participants were 789 adults (226 males and 563 females) who (1) agreed to participate in an observational cohort study of depression in general practice [12] and (2) recorded a score of 16 or more on the Center for Epidemiologic Studies Depression Scale (CES-D) [13] in an initial screening survey. The CES-D is a widely used, short, self-report scale designed to measure depressive symptoms. The cut point of 16 provides the best discriminative ability to detect significant depressive symptoms in both community and patient samples [14]. At least 2 weeks after the screening survey, participants completed a self-administered questionnaire which included the SPQ. Because the SPQ items were intended to be summed together to provide total score, Rasch analysis (partial credit model) was applied to assess the psychometric properties of the scale [15].

Analytical methods

Data were fitted to the Rasch model using Rumm 2030 software [16]. Because the data adequately fit the Rasch model, simple comparisons of the items and respondents were possible. Analysing the differences of the estimates of the item parameters were assumed to be independent of the distribution of the participants being surveyed, however, the power of the test of fit may be affected by Rasch reliability measured through the person separation index [15].

The overall fit statistics were considered. The item–person and the item–trait interaction statistic were both transformed to approximate a z score, and it was reported as a Chi-square which reflected the property of invariance across the trait. Because the number of response categories varied across items, the item thresholds were not equal, and therefore, the partial credit parameterization was used to test the data [17].

Results

Overall fit

As the residual misfit for individual persons may skew the analysis due to their aberrant response patterns, misfitting persons were examined. There were only two respondents with a fit residual outside of the range ±2.5, and all results were confirmed by excluding them from the analysis.

Initial inspection of the scale showed poor overall fit to the Rasch model as evident in the standardized item fit residual statistic (mean = −0.555, SD = 1.857) and the item–trait interaction statistic (χ 2 = 473.447, df = 198, p < 0.001). This suggests that the relative operation of the items was not consistent across all levels of the underlying trait and that there was misfit between the data and the model [18]. The overall fit of the model was confirmed by inspection of the individual item fit statistics revealing misfitting items and graphs of the Item Characteristics Curve (Fig. 1).

Fig. 1
figure 1

Item characteristic curve for item 11

Thresholds

The pattern of thresholds was examined to assess whether disordering might affect the fit. Initial inspection showed that the thresholds of most items were disordered. Exceptions included the four dichotomous items and items 7 and 8. Disordered thresholds occur when respondents have difficulty consistently discriminating between response options. This can occur when there are too many response options, or when the labelling of options is potentially confusing or open to misinterpretation [19].

For responses to items with disordered thresholds, the category and thresholds probability curves were inspected (Figs. 2, 3). For a well-fitting item, it would be expected that across the entire range of the trait, each response category would systematically take turns showing the highest probability of endorsement. Inspection of the category and thresholds probability curves for all 22 items confirmed that none of the thresholds operated as required [18]. The thresholds and category probability curves for items 10 (Gone to watch a sports event), 11 (Gone to a party or dance) and 12 (Played sport) suggested that respondents were not able to reliably distinguish the response categories “Never” and “Rarely”.

Fig. 2
figure 2

Category probability curve for item 4 before rescoring showing disordered thresholds

Fig. 3
figure 3

Threshold probability curve for item 4 before rescoring where thresholds did not discriminate

Disordered thresholds can be remedied by combining adjacent categories and retesting the fit of the data to the Rasch model. Although excessive collapsing is not recommended [20], it can be justified in cases if discrimination at a pair of thresholds is close to zero and threshold estimates are reversed from their natural order [21].

Items 1–18 were rescored one at a time (Table 1). After a series of separate analyses, this procedure resulted in an improvement of the overall model fit, with a change in the item–trait interaction probability value from 0.000 to 0.272. Although this represents a change to the original format of the questionnaire, it more closely represents the actual response patterns of the respondents in this sample. It was confirmed by inspecting threshold probability curves (Fig. 4) which showed that rescoring resulted in improvement in the functioning of the item with well-discriminating threshold.

Fig. 4
figure 4

Threshold probability curve for item 4 after rescoring where thresholds operated successfully

The rescoring of these items suggests that the questionnaire may potentially be modified by reducing all responses categories to a binary format. It is possible that “Yes”/”No” response categories for the question: ‘In the past 12 months, have you done any of the following activities’ might be more valid than using a multiple response format. Exceptions apply to 5 items: Visited family/family visit; Visited friends/friends visit; Gone to watch a sports event; Gone to a party or dance and Played sport, where more specific response categories such as “Monthly or more”/”Less than monthly” would be more appropriate.

Overall fit after items rescoring

Rescoring improved the overall model fit, with a change in the residual standard deviation value for items from 1.857 to 1.043 and a change in the item–trait interaction total Chi-square probability value from 0.00 to 0.27. Relative operation of the items was consistent across all levels of the underlying trait and overall the data fit the model [18]. An estimate of the internal consistency reliability of the scale based on the person separation index (PSI = 0.72) indicated that the questionnaire had reasonable person separation reliability. Therefore, it had reasonable power to detect misfit and indicated that the items measured an underlying (or latent) construct and is appropriate for group use [22].

The residual mean value for persons was −0.282 with a SD of 0.739, indicating no misfit among the respondents in the sample and higher discrimination than expected by the model. The residual mean value for items was −0.433 with a SD of 1.043, close to the expected value of 1, suggesting no misfitting items. Alternative recoding procedures were also checked, however, no other solution improved the value of Fit Residual Standard Deviation.

Finally, for the purpose of cross-validation, the model was confirmed using data provided by the same people who completed the SPQ 12 months later. This sample comprised 542 respondents (155 males and 387 females). The item–trait interaction statistic was χ 2 = 207.070, df = 176, p = 0.05, and the residual mean value for items was −0.378 with a standard deviation of 1.202. The fit residual mean value for persons was −0.287 with a standard deviation of 0.692; the PSI was 0.71.

Targeting for Social Participation Questionnaire

It is important, particularly in clinical practice, that measures are appropriately targeted to the population being assessed. Poorly targeted measures may result in floor or ceiling effects. Figure 5 shows the frequency distribution of responses along the Rasch calibrated metric scale of social participation which provides an indication of how well targeted the items are for the study population. The left panel displays the frequency distribution of respondents along the Rasch calibrated metric scale of the attribute being measured (social participation). Cases at the bottom of the display have the lowest location value, representing low levels of social participation, while cases at the top have a high magnitude of social participation.

Fig. 5
figure 5

Person item map showing distribution of persons for revised SPQ

The right side of the graph displays the positions of each of the items. Items positioned at the bottom of the display are the easiest items to endorse. In this case, the most commonly endorsed item was item 7 (Gone to a café or restaurant) while the least commonly endorsed was item 21 (Ethnic group). The lower endorsement of the Ethnic group item can be seen in the light of the overall sample characteristics in which only 17 % were born in a country other than Australia. While attendance at an ethnic club is not restricted to people born outside Australia, it is specifically targeted to this group. In contrast, going to a café or restaurant is an activity that is equally (with the possible exception of financial disadvantage) available to all participants.

The scale shows a small gap in the graph between item 5 (Gone to a social club) and items 3 (Visited neighbours/had neighbours visit), 8 (Gone to a club, pub or bar), and 9 (Gone to the cinema or theatre) indicating that the questionnaire could benefit from inclusion of further items to measure lower level (informal) social participation of the group as a whole.

Overall, the average mean person location value of −0.548 suggested that the questionnaire was reasonably well targeted (not too easy, not too hard) for use with this group. However, it was noted that some items were slightly less likely to be endorsed.

Individual item fit after items rescoring

Following the recoding of items, the overall fit of the model was confirmed by inspection of the individual item fit statistics (a summation of individual item deviations) revealed no misfitting items (Table 1). All items showed fit residual values within the range of −2.5 to +2.5, with probability value of Chi-square more than 0.01, suggesting that the data fit the model. Item 5, Gone to a social club, showed probability value of Chi-square of 0.002. However, after adjusting the alpha value for the Bonferroni corrections (0.01), the Chi-square fit statistics for this item no longer indicated significant deviation from the model. A Bonferroni adjustment is typically applied to the alpha value used to assess statistical significance by dividing the alpha value of 0.01 by the number of items in the scale, in this case, adjusted alpha value was 0.01/22 = 0.000455 [23].

Differential item functioning

Having reached a satisfactory fit to the Rasch model with the rescored items of the SPQ, assessment of differential item functioning (DIF) was conducted using both statistical and graphical procedures. Results from a previous study [2] indicated that demographic variables such as gender and age were significantly associated with social participation levels, and therefore, DIF analysis was conducted to compare item location with respect to gender (male/female) and age (18–34/35–54/55–76).

According to the DIF analysis, two items showed probability values exceeding the adjusted alpha value (the Bonferroni-adjusted p value was 0.05) (Table 2). As there was no non-uniformity for all items (i.e. no interaction between the class intervals and gender), uniform DIF for item 5 and 8 (Figs. 6, 7) was interpreted directly [24]. Females were less likely than males to endorse these items (items 5 and 8). Because the difference was significant (p < 0.001), these items were calibrated separately for males and females, treating it as two separate scale items [25].

Table 2 Uniform and non-uniform DIF statistics for gender
Fig. 6
figure 6

Differential item functioning graph of females and males for item 5

Fig. 7
figure 7

Differential item functioning graph of females and males for item 8

As sequentially resolving items helps enforce the other items to show artificial DIF, this procedure was conducted for gender [25]. Because items 5 and 8 had a similar mean square, each one was resolved first in turn. Because item 8 had a higher mean square, it was split first. Following resolving item 8, DIF analysis using Bonferroni-adjusted p value of 0.05 for item 5 still showed significant uniform DIF (F(1, 788) = 13.91, p < 0.0001).

Similarly, when item 5 was resolved first in turn, DIF analysis using Bonferroni-adjusted p value of 0.05 for item 8 still showed significant uniform DIF (F(1, 788) = 14.47, p < 0.0001). After splitting items 5 and 8 together, all items did not show uniform or non-uniform DIF (using both Bonferroni-adjusted p values of 0.01 and 0.05). The logit difference for males and females in the split item 5 was 0.7 and in split item 8 it was 0.9. However, splitting of items 5 and 8 did not result in an improvement in the overall model fit. Overall, females were less likely to participate in social activities such as going to a club, pub, bar or social club. This is consistent with a previous study showing more social contact through activities in public spaces for males [2].

In terms of the age differences (Table 3), older participants were less likely to endorse SPQ items compared to those in younger age groups (Fig. 8). As there was no non-uniformity for all items (no interaction between the class intervals and age), uniform DIF for these items was interpreted directly.

Table 3 Uniform and non-uniform DIF statistics for age
Fig. 8
figure 8

Differential item functioning graph for different age groups for item 11

To detect potential artificial DIF, a procedure of sequentially resolving items was conducted [25]. As item 11 had the highest mean square, it was split first and the overall model fit and individual item statistics was checked. Following resolving item 11, all items did not show uniform or non-uniform DIF (using both Bonferroni-adjusted p values of 0.01 and 0.05). The logit difference for resolved item 11 between participants in the oldest age group (55–75 years) and participants in the youngest age group (18–34 years) was almost one logit (0.9). The split items procedure resulted in a slight improvement in the overall model fit with PSI increased from 0.72 to 0.74. Overall, older participants were less likely to participate in social activities such as attending a party or a dance, which is consistent with previous research showing that social contact through activities in public spaces decreased as age increased [2].

Unidimensionality

A principal component analysis was conducted on the residuals to identify the two most divergent subsets of items [26]. Items may load positively or negatively upon the first principal component and those items with the highest loading in either direction may represent sets of items that are the most likely to breach the assumption of local independence. Examination of the item residuals matrix revealed no correlations greater than ±0.3, and the eigenvalue of 1.932 for the first component was only slightly larger than the eigenvalues for the other components [27]. The initial principal component analysis explained only 8.787 % of the total variance among residuals which suggested a small possibility that a second component existed [28].

Further analysis of the pattern of residuals showed that the residuals loaded in opposite directions on the first principal component (Table 4). This showed that these items have a common feature in addition to the factor structure that is common across all the items [26]. A clear distinction between the two subsets of the component was that items in the positively loaded set represented informal social activities while items in the negatively loaded set represented, with the exception of the visiting one’s neighbour’s, formal group participation. Participation in the informal activities may or may not involve other people, whereas the activities in the latter set are designed around a group format.

Table 4 Principal component analysis of the residuals showing loadings on the first component extracted

These two subsets of items were then separately fitted to the Rasch model and the person estimates obtained. The differences in person estimates derived from these analyses were trivial. The difference between the first subset and the overall scale was 0.07 logit, and the difference between the second subset and the overall scale was 0.03 logit. The reliability for the SPQ obtained from the unidimensional approach was much higher than reliabilities calculated from the two-dimensional approach.

A paired samples t test (at 5 % level of significance) was conducted to test whether person estimates derived from the two subsets of items were significantly different from each other. Corrected for the errors of measurement correlation between the two subtests was of the order of 0.36 and 0.35, and there was no statistically significant difference in the person estimates derived from the two subsets (set1: M = −0.557, SD = 1.297; set2: M = −0.504, SD = 1.256, M diff = −0.053, SD = 1.449; t(789) = −1.027). These results indicate that assumption of unidimensionality was not breached [22].

The person ability estimates based on the subsets of items were then generated and compared using a series of independent t tests (one for each case). Although the person locations estimated from these two subtests differed significantly for 6.5 % of participants at 5 % level of significance, the 95 % confidence intervals around this estimate, derived from a binomial distribution, included five per cent (CI: 4.8–8.4) which indicated that the unidimensionality of the 22-item SPQ was supported [26].

Response dependence

The matrix of correlations from the standardized residuals showed that there was only one correlation that exceeded a value of 0.2 between items. These items were 10 (Gone to watch sports event) and 13 (Had social contact through children’s sport). This correlation might indicate some form of response dependence.

Item 13 was closer to the mean of the person distribution (−0.548) while Item 10 was slightly more difficult (Table 1). Therefore, item 10 was resolved, based on the response on item 13, and the value of estimated magnitude of response dependence between items 10 and 13 d was calculated [29]. The value of estimated magnitude of response dependence between items 10 and 13 d was highly significant at a conservative 1 % (>2.58) level of significance (z = 0.613/0.045 = 13.699). However, the magnitude of the difficulty caused by dependence was of the order of only 0.6 logits. Those participants who scored 0 (“Never”) on item 13 (Had social contact through children’s sport) found item 10 (Gone to watch sports event) more difficult to endorse, however, it was only 0.6 logits greater than if there was no dependence. The result of the analysis was confirmed using complete data set.

Conclusion

The current study undertook a rigorous examination of the SPQ. Using an Australian sample of primary care patients with depressive symptoms, it would appear that the SPQ is a promising measure of social participation. However, the questionnaire could benefit from the inclusion of further items to measure lower levels of social participation. Rasch analysis revealed several valuable insights into the psychometric properties of the SPQ. Social Participation Questionnaire shows promising internal consistency and reliability which supports its use in studies including people with depressive symptoms.

Overall, the questionnaire was reasonably well targeted for use among adults with depressive symptoms, although the sample as a whole found some items slightly harder to endorse than others. Rasch analysis showed a single, unidimensional structure of the questionnaire. A rigorous test of the local independence assumption demonstrated a very small magnitude of difficulty caused by dependence. Using the 12-month follow-up, sample for cross-validation of the Rasch-derived SPQ supported its utility and structural validity.

Further validation studies of the revised SPQ in the same cohort of adult patients with depressive symptoms will be conducted in the future to see whether additional items make it more appropriate for use with this group. These modifications should target the spectrum of informal social contact and include items such as: Gone for a walk with friends/family, Gone shopping with friends/family, Gone to museum or gallery with friends/family, or Talked to friends or family on the phone. In addition, because the initial scoring was not consistent with the model, modification of the questionnaire format could be trialled by simplifying all response categories to a binary format. More straightforward responses labels such as: “Yes”/“No” or “Monthly or more”/“Less than monthly” could be examined. It is possible that “Yes”/“No” response categories for the question: ‘In the past 12 months, have you done any of the following activities’ might be more valid than using a multiple response format. Exceptions apply to 5 items: Visited family/family visit; Visited friends/friends visit; Gone to watch a sports event; Gone to a party or dance and Played sport, where more specific response categories such as “Monthly or more”/“Less than monthly” would be more appropriate. Reducing the number of categories and simplifying their response labels will also decrease the burden on respondents and make interpretation of results easier. As the possibility of gender and age influencing responses to some of the items was demonstrated, the SPQ will also be tested to assess category ordering and the presence of DIF particularly in relation to age differences across clinical groups.

Facilitating social activities among adults is a promising direction for programs intended to promote mental health. Further validation of the questionnaire with additional items using the same sample will assess in depth the pattern of social participation in patients with depressive symptoms and might help in defining determinants of social (limited social activity) isolation.