Introduction

Self-reported service use data are used extensively in health services research because they provide comprehensive information on a variety of services, and yet, are relatively inexpensive to obtain.1,2 Self-reported data are particularly useful in behavioral health services research where any single source of provider records cannot describe all of the services received because of the number and wide variety of provider types available for treatment. Persons with behavioral health problems use a variety of services not only in conventional clinical settings, but also in conjunction with welfare programs and often in the criminal justice systems.3 Typically, administrative records can only provide information about services received within one agency or organization, whereas a person’s self-report ideally would include services received in all settings, from all providers. In an economic evaluation study taking a societal perspective, self-reported service use data could be more comprehensive and valid than information available from provider or insurer records, which may only represent service use from the perspective of the administrative party who maintains the data.4

The usefulness of self-reported service use data depends on the validity and reliability of the measurements.5 Much attention regarding psychometric properties of self-reported health service use has been given to the validity of these measures. Generally, validity studies compare self-reported service use against administrative records and show inconsistent findings. Some show favorable level of congruency between data from the two sources,6,7 but others do not.810 Disagreements are often ascribed to different ranges of services represented in the two data sources, errors in the administrative data such as incomplete recording, and recall bias in self-reporting.710 To compensate for problems with both sources and to produce comprehensive and valid utilization data, hybrid method can be used by collecting self-reported data using a brief measure of provider contact, such as the Brief Health Services Questionnaire, and then retrieving provider records for detailed health care use information.7,12 Even in this method, however, obtaining reliable answers from self-reporting is prerequisite for further collection of information from provider records.

Our study focuses on the reliability of self-reported services use. Assessment of the reliability (or consistency) of self-reported health service use requires repeated measures for each survey item (or test–retest data), preferably by the same rater, within a reasonably narrow time window. Partly because of the lack of appropriate data, there is a paucity of evidence on reliability of self-reported service use. Test–retest is a commonly used method in psychology and education research to assess the reliability of survey items. In the health services literature, test–retest has been frequently applied to examine the reliability of self-reported health status.11,13

We identified eight studies that examined reliability of self-reported health service use.1421 All but one study21 were conducted on small samples of focused populations and in a particular geographic area: six studies tested similar instruments, which were designed to elicit responses from parents and children about which services children used1420; and one study examined the use of typical medical services (inpatient, outpatient, and emergency room) among persons with schizophrenia.19 All the previous studies, except two, used samples drawn from either medical or mental health service settings to ensure enough service use to make test–retest reliability results valid.1418 Two studies, which used community samples rather than clinical samples, used limited measures: one examined any use of health services with a high school sample,20 and another examined any use of preventive screening services in a small subsample from a national survey.21

Taken together, the existing studies on reliability of self-reported services use consistently reported substantial agreement for any use (yes/no) of service for specific service types, and fair to moderate agreement for the quantity of service for specific service type. By service type, reliability of reporting was higher for inpatient care than outpatient-based services and higher for aggregate service categories than more specific service categories. However, self-reporting of more specific information, such as frequency of outpatient services provided by mental health professionals, tended to be less reliable.

Regarding determinants of reliability, studies document that question factors, such as sentence complexity, recall period (time between events and reporting), and service types are more important than individual characteristics, such as age, gender, and ethnicity.16,18,20 None of the previous studies has examined reliability of specific content of services received or evaluated the impact of inconsistency in reporting on evaluation study outcomes, which were explored in the present study.

The present study examines the reliability of self-reported service use among women with behavioral health problems and extends the population studied and scope of analyses beyond that of previous studies. First, the study participants are from a population not studied before, and they were recruited from nine sites nationwide representing diverse geographic areas. Second, this study examines instruments measuring diverse dimensions of service use including any use as a binary variable, quantity of use for service users, and content of service in terms of the focus of treatment for each service type. Third, the survey instrument from this study captures a comprehensive range of services including typical inpatient and outpatient care, residential treatments, and jail or shelter use from which a significantly large fraction of participants with behavioral health problems received services.3,22 The reliability of self-reporting for services received from these atypical sectors is unknown. Fourth, this study is the first to examine factors influencing consistent reporting and to explore the impact of inconsistency of reporting on the robustness of overall cost estimates.

The specific research questions this study seeks to answer are: (1) What is the test–retest reliability of self-reporting on quantity and content of service use in a variety of settings where health care services are provided? (2) What are the determinants of the consistency of self-reported service use considering such factors as service type, level of service use, severity of psychiatric symptoms, study site, and demographic characteristics? and (3) How sensitive are cost estimates to inconsistency in the quantity of self-reported service use with repeated measures?

Methods

Data

The test and retest data used in this study come from the baseline survey of the Women, Co-occurring Disorders, and Violence Study (WCDVS) conducted in nine sites nationwide from 2001 to 2003. The study participants were women with psychiatric and substance abuse disorders and histories of interpersonal violence. The WCDVS is a quasi-experimental study with an intervention arm that provided comprehensive, integrated, trauma-informed, and consumer/survivor/recovering person-involved care, and a comparison arm that provided usual care in each of the nine sites. Other details about the WCDVS study design have been reported previously.23 Because retest data were collected at baseline, before the intervention was executed, no intervention effect is anticipated in the present study.

The retest sample included 8% (n = 186) of all study participants (n = 2,729) at baseline. There were approximately equal numbers of participants from each study site and in both intervention and comparison arms. The retest participants were randomly selected and hence display characteristics similar to the other WCDVS participants (Table 1). The retest interview (retest hereafter) was conducted during an average of 7 days (s.d. = 4.2; range = 2–35 days) after the initial interview (test hereafter) by the interviewer who conducted the test and used the same set of survey items as those used in the test. One exception was that the retest used the identical recall period that was used for the test (i.e., previous 3 months from the original baseline interview date) for the service use questions. All the survey questions were read and answers were recorded by interviewers during in-person interviews.

Table 1 Sample characteristicsa

The characteristics of the study sample, described in Table 1, are based on the test data. The study sample consists of women aged 19–59 with a mean age of 37 and represents diverse racial groups, education levels, marital status, and insurance status. General psychological distress level was measured using the Global Severity Index (GSI). GSI is the average score of Brief Symptom Inventory (BSI),24 a 53-item self-report scale (ranging from 0 to 4; higher scores indicating greater severity) measuring nine psychiatric symptom dimensions. The average GSI, 1.4, was quite high, which reflects the fact that all the women in our study sample had complex behavioral health problems.

Variables

Study participants were asked to report frequency and content of services they received during the last 3 months in a variety of categories (Table 2). The survey instruments capture all the services received and are not limited to services received at the participating study site. For each service type, respondents were asked if they received any service. If positive, respondents were asked to answer questions on the frequency of service use. The frequency of emergency room visits and the number of days in inpatient (or overnight stay) facilities were requested with open-ended questions. For counseling sessions or outpatient visits, frequency categories were used instead of open-ended numeric responses. Each frequency was converted into the total number of visits or sessions during 3 months for the analysis. “Daily” was converted into five times a week or 65 times for the 3 months; “a few or two to four times a week” into three times a week or 39 times for the 3 months; and “two to three times a month” into 2.5 times a month or 7.5 times for the 3 months. We also used the original categorical scale and found similar results, and thus presented results based on the continuous scale throughout this paper. For all service types, respondents were also asked about the content of services received for each stay, visit or session and could choose one or more of the relevant categories. Figure 1 demonstrates how quantity and content of services were measured and coded.

Figure 1
figure 1

Examples of responses to WCDVS interview questions on service use

Table 2 Test–retest reliability of self-reported service on any use and quantity of service use

For the reliability of quantity of service use, the ten service types examined are hospital, emergency room, detoxification, individual and group counseling at residential and outpatient facilities, outpatient medical visits, homeless or domestic violence shelters, and jail. The frequency of service use was generally high, ranging from 19% for hospital to 61% for outpatient medical visit during the last 3 months. The average number of hospital days was 1.9 and the average number of individual counseling sessions in a residential facility was 2.2. See Table 2 for the frequency and intensity of other types of service use.

For the reliability of content of service, the four content areas examined are physical health, mental health, substance abuse, and trauma in each of the five service types: individual and group counseling during residential stay and outpatient visits and outpatient medical visits.

Analysis

Agreement between test and retest data on service use is indexed for dichotomously coded services by Cohen’s kappa statistic (k)25 and for continuous-scaled measures of service use by the intraclass correlation coefficient (ICC).26 Following the method proposed by Shrout,27 we interpret kappa and ICC below 0.1 to represent no agreement; 0.1 to 0.39 as slight agreement; 0.4 to 0.59 as fair agreement; 0.6 to 0.79 as moderate agreement; and above 0.8 as substantial agreement. The reliability of self-reporting is assessed with the magnitude rather than statistical significance of these indices. Although k and ICC are by far the most widely used indices, they are not free from limitations. These indices are influenced by the variability of event frequency and could be upwardly biased for seldom used services.28,29 However, this potential bias might not cause serious problems given that frequencies of events in our sample were relatively high even for less frequently used services such as hospitalization (19%) and jail use (20%).

Multivariate regressions are used in the analysis of determinants of consistency of reporting. A logit model is used for the dependent variable indicating consistent reporting of any use of services in test and retest data (1 = agreement), and a linear regression model is used for the agreement rate in quantity of use among any users. This study defines agreement rate as 1−|(N TN R) / (N T + N R)|, following a method similar to that used in the literature,6,30 where N is the total number of visits or stays, T is test, and R is retest. The agreement rate ranges from 0 (none) to 1 (perfect). The number of observations for the any use model (n = 1,820) is the number of respondents with valid answers for all covariates (n = 182) multiplied by the number of service types (n = 10). Those who reported non-zero response in either test or retest for each service type are used for the analysis of quantity of use (n = 725). Factors examined are service type, time interval between test and retest, study sites, total number of visits or days of 10 service types as a proxy of utilization level, GSI at the test interview, age, race, marital status, education level, and insurance type. Service type is coded with dummy variables with a hospital day as the reference category. Huber-White cluster-adjusted robust standard errors are used to correct for individual clustering across service types.

Finally, we estimate cost for each service type and overall service cost using test and retest data to examine whether the cost estimates drawn from two datasets differ because of the potential inconsistency in reporting. The estimate of unit cost of each type of service is from diverse sources and approximates the societal perspective as is described elsewhere.31 To draw statistical inferences between estimates from test and retest data, standard errors are calculated by bootstrapping with 500 replications with replacement.

Results

Test–retest reliability of quantity of service use

The test–retest reliability of self-reporting on any use of each category of service is generally good. The levels of agreement are moderate to substantial across all service categories (k = 0.65–0.94), highest for jail days and lowest for outpatient medical visits (Table 2). Agreement on the quantity of service use is lower than agreement on any use, ranging from slight (ICC = 0.12) for outpatient medical visits to substantial (ICC = 0.93) for shelter days.

The reliability of the total number of days in inpatient facilities is substantial for all the services except for detoxification days (ICC = 0.64). Individuals may not distinguish services received through detoxification from those received in residential facilities. When these two categories are combined, the reliability of service quantity improves (ICC = 0.79; result not in the table), but remains at a moderate level.

The number of residential or outpatient counseling sessions and outpatient medical visits show slight to substantial agreement (ICC = 0.12–0.82). The reliability of reporting is higher for counseling services received in outpatient settings than for counseling services received during residential treatment, and is lowest for outpatient medical visits. When aggregated, the reliability of any outpatient visit and any residential counseling is moderate (ICC = 0.61, 0.74, respectively). We repeated all the analysis with the two subgroups above and below the median level of GSI, 0.76, and found no noticeable difference between the two groups.

Test–retest reliability of content of service use

The reliability of self-reported service content during residential or outpatient counseling and outpatient medical visits ranges from none to moderate (k = −0.06–0.79) (Table 3). Generally, the reliability of reporting on the content of services received during counseling sessions is higher for mental health (k = 0.56–0.77) or substance abuse (k = 0.52–0.75) than for trauma (k = 0.45–0.60) or physical health (k = −0.06–0.40). An exception is that for outpatient medical visits, physical health is more consistently reported (k = 0.63) than other content areas (k = 0.12–0.59). The reliability of reporting of service content is higher for services received during outpatient visits (k = 0.13–0.77) than for those received during residential treatment (k = −0.06–0.75). Again, the reliability increases only slightly when aggregate categories (i.e., any residential counseling, any outpatient visit) are used.

Table 3 Patients perceptions on service contents—mental health/ substance abuse/ trauma

Determinants of the consistency of self-reporting

We find few observable factors that are associated with consistent reporting. For any use, counseling services and outpatient medical visits are less likely to be consistently reported than hospital use, after controlling for other relevant factors (Table 4). For quantity of use, only the number of outpatient medical visits is less consistently reported than hospital days. Consistency of reporting also varies across study sites, which may reflect differences in the research staff who conducted the interviews. White race (vs. other race) and some college education (vs. less than high school) are associated with more consistent reporting in quantity of use. None of the following factors affects consistency of reporting: level of mental distress (GSI), level of service use in aggregate, and time interval between test and retest.

Table 4 Predictors of consistency in reporting: any use and frequency of service usea,b

Robustness of cost estimates from test and retest data

The average total cost estimates from test and retest data are $9,168 (s.d.: 10,128) and $8,883 (s.d.: 10,243), respectively (Table 5). Note that the variances in costs are quite large, which is typical for cost data particularly among high-end users of health services. Rather than excluding extremely high cost users using an arbitrary cut-point, we used standard errors from bootstrapping to address the issues of relatively skewed distribution and small sample size. The mean difference in total cost ($285) is only 3.2% of the average total cost and is not statistically different from zero. By service type, hospital costs ($2,772; 30%) and residential treatment costs ($1,800; 20%) comprise a majority of total costs in the test data.

Table 5 Cost estimates in test and retest data

Discussion

This study adds to the literature on the reliability of self-reported service use by extending the population studied and the scope of analyses. Studies on children in the community1418,20 and on persons with schizophrenia19 have shown substantial agreement in reporting any use of service and fair-to-moderate agreement in reporting quantity of services. Consistent with the previous findings, this study shows moderate to substantial agreement for any use and slight-to-substantial agreement for quantity of services, among women with behavioral health problems.

The wide variation in reliability by service types is notable. Quantity of service is more consistently reported for inpatient days than for outpatient visits, maybe because inpatient stay is a more salient episode and thus easier to remember than outpatient visits. On the other hand, quantity of counseling services is more consistently reported for services received during outpatient visits than for services received during residential treatment. The treatments received during residential stay are so complex that service recipients may be difficult to discern specific treatment elements. It is also likely that the frequency of receiving specific services while staying in residential facilities might be harder to remember than the frequency of visits to outpatient facilities that require more effort and time to attend. We also found that reliability improves by aggregation of service categories, which suggests that a lower level of details would be easier to remember and answer consistently.

A confounding factor that might have influenced the reliability of counseling services and medical visits is the wording of the question. Frequencies of these services were elicited by fixed categories for the average frequency of service use per week or per month during the previous 3 months versus open-ended questions about the total frequency during the previous 3 months for other service types (See Fig. 1 for an example of each). Therefore, the difference in reliability may partly come from the difference in question format (i.e., categorical vs. open-ended). Furthermore, because counseling and outpatient visits are high-frequency events, an inconsistency in the answer for one category may result in a large difference in the total frequency over the 3-month period. With this survey design, the variation in reliability ascribed to different question formats could not be teased apart from the variation ascribed to different service types.

This study provides novel evidence on the reliability of self-reported content of care received during counseling services and medical visits. The reliability of service content is generally lower than the reliability of service quantity, and is below the acceptable level (k < 0.4) for some categories. Particularly of concern is the lowest reliability of reporting on service content during medical visits, which is the most common type of service relying on self-reporting in health services research. People with behavioral health problems receive a variety of services and therefore may have difficulty in differentiating services focusing on behavioral health from those addressing comorbid physical health problems during medical visits.

We find no evidence of an association between severity of psychiatric symptoms, measured by the GSI, and the consistency in reporting. This is consistent with the findings of other studies,6 , 33 , 34 which reported validity or reliability of self-reporting was not influenced by the severity of psychiatric conditions. On the other hand, type of illness or symptomatology may influence reliable reporting because of cognitive deficits associated with some psychiatric conditions. We were not able to investigate the variation across different symptomatology because of the limited sample and data on diagnostic information. Previous studies have shown that self-reported health behavior or services use among persons with severe mental illness or substance abuse problems are also reliable and valid.19 , 3539 This suggests that there would be little influence on reliability of reporting because of cognitive deficit associated with psychiatric conditions.

For both any use and quantity of service, the total number of outpatient medical visits is significantly less likely to be consistently reported than hospital days, which cautions against the wide use of self-reporting in measuring the frequency of outpatient medical visits. These results are consistent with the literature on determinants of the validity of self-reporting, which indicates that the saliency of events and well-defined (vs. ambiguous) events accounts for more of the variance in response accuracy than any other class of variables.5

The overall findings on the determinants of consistency of reporting suggest that factors associated with the survey administration are more important than those representing subject characteristics. Similar findings were reported in previous studies on health services use among children.18,20 These findings are also consistent with the literature that indicates task factors, such as question form, wording, and mode of administration, account for more of the variance in response accuracy than any other class of variables.32 On the other hand, it is noteworthy that a large proportion of the variation across repeated measures was not explained by the variables in the model, as indicated by relatively low R-square (0.16). More detailed information on individual and service and availability of different question forms would help increase our understanding of determinants of reliable measures of services use.

One of the important applications of self-reported service data is in economic evaluation research. Our results show that although reliability varies across service types, the aggregated cost estimate for overall service use is robust across repeated measures. This robustness is partly because reliability of reports of quantity of use is higher for the more intensive and costly services, such as hospital use. The less consistently reported services, such as outpatient medical visits, constitute a small proportion of total cost for the population of this study.

In interpreting our results, one should be careful in generalizing our study findings to population or settings different from ours. Reliability of self-reporting among our study participants (women with behavioral health problems) could be different from the reliability among other groups of behavioral health service users. Furthermore, the results indicated by kappa and ICC indices measured for different populations might not be directly comparable.

Based on the findings and the limitations of this study, we suggest several areas for further research to better understand the reliability of self-reporting of health service use. First, future research should explore the relationship between question phrasing (e.g., open-ended vs. close-ended) and response reliability of quantity of service use and between service type specification (e.g., residential treatment vs. specific type of services during residential treatment) and reliability of treatment content during the service use. Such research would help in developing survey instruments that induce more reliable data on service quantity and content during health care services. Second, future study may also consider aided recall to stimulate the memory for specific events. For example, providing a motivation to remember or contextual cues may considerably improve reliability and validity of recall because the vicissitudes of memory are common to both the test and retest data, particularly among clients with complex treatments or with cognitive deficits. Similarly, a shorter recall time frame than the 3 months in this study may improve the reliability of reporting. Third, the reliability of self-reported service use in other populations with different ranges or levels of services, such as clients with physical health problems or clients of primary care, would help in assessing generalizability of our findings. Finally, further study should examine the validity of self-reported service use data in populations similar to ours. A review of provider records and other objective and unobtrusive measures would be valuable in checking the validity of client’s self-reporting. Such evidence is essential to understand psychometric properties of self-reported data in populations similar to ours and will allow for the comparison of validity of self-reported services use among diverse populations.

Implications for Behavioral Health

Although self-reported data are widely used in assessing health service use, evidence on the quality of the data, particularly on the reliability of reporting, is very limited. Findings of our study suggest that among individuals with behavioral health problems, self-reported health service use data are reliable in capturing the quantity of services received in a variety of service areas. However, self-reporting of treatment content in highly specified service categories (e.g., individual counseling during residential treatment) may not be reliable. Similarly, the low level of reliability for the quantity of service use and content of service during outpatient medical visits, the most common medical events, needs attention. To determine the quantity and content of service use during general medical visits, physician records may be a better alternative than participant responses.

Despite some lack of agreement in reports of quantity and content of services, cost estimates did not vary with repeated measures and were unaffected by the inconsistency in reporting. Self-reported service use data produce robust cost estimates in aggregate and have the unique advantage of encompassing comprehensive types of service use. Therefore, self-reported service use data can serve as a useful source of information for the economic evaluation of behavioral health service programs.

Our findings on determinants of consistent reporting suggest that reliability of reporting varies widely by service types and may be improved with better measurements or administration methods, but may not be sensitive to respondent characteristics such as demographics and disease severity. However, more evidence from using different survey instruments, study populations, and study settings is needed to generalize our study findings to a broader behavioral health service context.