Introduction

In the past decades, research focusing on health outcomes measurement has experienced an enormous expansion [1]. Health-related quality of life (HRQL) is amongst the most important of these outcomes. A recent systematic review identified 1,275 different instruments measuring HRQL and other related outcomes by the year 2000 [2].

The definition of HRQL and related concepts like health status and perceived health, among others, has been disputed and elusive, resulting in no single concept being universally adopted [35]. In a recent review of 68 different HRQL models, Taillefer et al. [6] observed that about 4 out of 10 models did not provide a clear definition of the concept. When definitions were provided, they differed significantly in their content.

Fostering simplification, the Food and Drug Administration (FDA) has hence proposed the umbrella term patient reported outcomes (PRO): “a measurement of any aspect of a patient’s health status that comes directly from the patient (i.e., without the interpretation of the patient’s responses by a physician or anyone else)” [5, 7]. The term is not new in the field [8], and it is appealing. Rather than overcoming the conceptual problems related to the conceptualization of the constructs being measured, this approach simply avoids them by focusing on the source of information rather than on the content. At the same time, it emphasizes the genuine importance of the individual’s own perspective when making the evaluation.

Several rigorous and useful classifications of health outcome measures have been published previously [1, 3, 7, 920]. Many of them claim to have been specifically devised for HRQOL. A current Guidance for industry by the Food and Drug Administration has been the first to address such a classification from the unique perspective of PRO. Nevertheless, concerns have been raised about its limited focus (clinical trials) [21] and utility, due to the number of criteria (by far the longest) and their nature (including as such the numbers of items or the frequency of administration, not well supported in the literature) [7]. Its most significant limitation, similar to that of other previous attempts in the literature, is the lack of an explicit link to any valid model of health outcomes, whether supported by empirical evidence of validity or not [4, 22].

A classification system linked to a conceptual model would represent a substantial improvement for identifying a candidate pool of PRO instruments for a given purpose, since it would facilitate a comprehensive view of measures (including the identification of areas where there are a number of measures, and areas where there is a lack of them). It would also facilitate the selection of PRO measures to be used in research, management, and, eventually, in clinical practice, if standard guidelines were provided along with the classification system [23].

In this paper, we present the development of a classification system of PRO instruments, based on a valid conceptual model of health outcomes, and we apply it to the most commonly used instruments. We also discuss the added value of our approach for a broad range of health professional users.

Methods

We aimed to develop a simple classification system based on the minimum possible set of relevant criteria. We reviewed previous classifications of PRO measures and identified different areas of classification. Starting with selected previous classifications [1, 3, 7, 11], we applied a snow ball technique to identify additional references [24] (Table 1). Three concepts were consistently pointed out as important, even if using different wording: the construct (or measurement object), the population to be assessed (range and characteristics of the people to whom the instrument should be applied), and the measurement model (Table 1). These concepts are the independent non-hierarchical principles (axes) in our classification system, and are instrumental in answering key questions in the measurement process (what? whose? and how?) (Table 2a). Within each axis, categories were established which characterize each instrument in relation to that particular axis (cross-classification) [26].

Table 1 Criteria used for the classification of patient reported outcome measures
Table 2a A classification system for patient reported outcome measures: axis and categories

We then applied the classification to a selection of the most evaluated and used PROs, as identified in a previous systematic review [2]. In the absence of a ‘gold standard’, the identification of the constructs measured by an instrument is performed on the basis of the review of its content (content validity), further supported with evidence of its relationship with other related variables (construct validity) [27]. The assessment of the content of each instrument was based on the content of each item, the minimal units that form all PROs. Every item is a stimulus, in the form of a question, task or individual component in a scenario, which the individual is given in order to elicit a response [28]. It can thus be considered as an operational definition of the intended measurement object, and this was the basis for our classification of constructs of the selected instruments, using the definitions provided in the next section.

One of the authors (J.M.V.) classified all of the instruments using the previously defined axes and categories, and the final classification for each instrument was agreed by consensus among the authors (J.M.V., J.A.). To exemplify the use of the classification system across categories not covered by these instruments, we further exemplified its use with other selected instruments.

Results

The classification system and its rationale

Construct

Construct is the range of characteristics (traits and states) measured by the instrument, its measurement object. Our classification of constructs relies on the model proposed by Wilson and Cleary [29], a well-established bio-psycho-social model for health outcomes [30] (Fig. 1). Sullivan et al. [31] tested the model in Dutch community-dwelling elders and did not find it completely satisfactory. In more recent years, though, strong empirical evidence has been obtained supporting its validity in a variety of contexts, including both the general population aged over 65 [32] as well as patients living with HIV [33] or suffering from coronary heart disease [34, 35], and, most importantly, with very different types of measures (including the SF-36 Health Survey, the Nottingham Health Profile, Health Assessment Questionnaire-Disability Index (HAQ-DI), and the MacNew Heart Disease Quality of Life Questionnaire, among others) [3235].

Fig. 1
figure 1

An integrated model for health outcomes. Modified from [29] and [46]

While the model of Wilson and Cleary is the foundation of our classification proposal, this model can, to a considerable extent, be integrated with the theoretical model underpinning the International Classification of Functioning, Disability and Health (ICF). This ambitious classification system of health states has been proposed by the World Health Organization, and is based on a sociological perspective of health that considers disability along the whole functioning continuum [36, 37]. Both models have been conceived independently, and still they share significant characteristics. They both differentiate health related variables and contextual factors, further splitting the latter into environmental and individual characteristics (Fig. 1). Biological and physiological variables in the model by Wilson and Cleary correspond to the structure component of the ICF, and functional variables in the first model equally correspond to the activities and participation components of the latter [38]. Current and previous successful mapping of responses to PRO measures onto the international classification of functioning system support the validity of this integration [39, 40].

Based on the model by Wilson and Cleary, we differentiated and defined the following concepts: symptom status, functional status, health perceptions, and health related quality of life (Box 1). Although the original formulation of the model considered the more general concept of overall quality of life, all empirical evidence has been obtained for health related quality of life [3335], and this was the construct finally included. In addition, the original model also considered biological or physiological variables, but the patient is then usually not the preferred source of information, and so this category has not been included in this classification system for patient reported outcome instruments.

Box 1 Definitions of relevant terms used in the classification system [29]

We also considered some other health related constructs that were not specified in Wilson and Cleary model nor in the ICF classification system, most notably satisfaction with health care (the extent of an individual’s experience with health care compared to his/her expectations) [43]. Although the construct satisfaction with care may have been used less extensively, its well-described nature as a health outcome and its widespread use support its inclusion in the classification system [44], as is also the case for resilience (ability to cope or withstand stress and illness) [18, 43]. We have therefore included an additional category for “Other Health Related Constructs” in our model (Fig. 1, Table 2a)

Previous classification systems have relied on ad hoc constructed lists for symptoms and diseases. In order to achieve better standardization, we propose to classify symptoms relying on the implied codes in the International Classification for Diseases ICD-10, 2nd version [45] (Table 2b). Similarly, functional status can be specified according to ICF chapters (Table 2c) [46].

Table 2b Specific categories for the construct “symptoms” and for the category “disease” in the construct populations”
Table 2c Specific categories for the construct “functional status”

Population

The population of a PRO measure is the universe of persons for which the instrument is suited. It is defined in terms of age and gender, presenting diseases (if any) and culture (Table 1), all of them undisputedly relevant to the characterization of the patient from a clinical, epidemiological, and organizational point of view. Further, the important health differences in subpopulations defined according to these criteria underlie the rationale for the development of subpopulation specific instruments [47]. We will not address here the concept of culture [9], but for the purposes of this classification only, we conceptualize culture as the dyad of language and country of the population for which the instrument has been devised.

Measurement model

Two issues are of utmost relevance to the measurements elicited by a PRO instrument (and therefore, to their interpretation): the theoretical model that sustains the metric of the instrument, and the level of aggregation of the score (dimensionality).

Metric refers to the method used to assign numeric values to the responses given by the individuals and the construction of the scores. Three broad groups of instruments can be distinguished: psychometric, econometric, and clinimetric [1, 3]. Scoring algorithms of psychometric instruments are broadly based on the sum of item responses for each scale, either weighted or not [48]. The main differences between psychometric and clinimetric instruments arises from the methods used in the scale development, with the former building upon theoretical models (i.e., sample domain theory) and using of sophisticated statistical methods [46], and the latter focusing almost exclusively on clinical relevance [49, 50]. These instruments are best suited for ordering individuals along a continuum for their comparison in clinical trials, monitoring or, to a lesser extent, screening and/or diagnosing of patients.

Econometric measures have come about due to the need to assess and value health states as separate entities. They aim to obtain values based on health state preferences (of patients, populations, experts, etc.), using methods from the field of econometrics, based on decision theory. These preferences are known as ‘utilities’, and the measuring instruments are called utility or preference-based measures. Utilities can be associated with an appropriate time interval in order to calculate the quality-adjusted life years (QALY) index [1]. Here, the proposal becomes an important aid to the use of PRO measure, since it has been clearly pointed out that some interpretation uses (e.g., cost-effectiveness analyses) are challenged when psychometric instruments are used [51]. Since other approaches might be also possible [47], a category for ‘other metrics’ was included in the proposal.

Dimensionality, on the other hand, refers to the number of scores produced for each individual. When the information of the instrument can be summarized in a single value we refer to an Index. Beyond obvious cases of instruments consisting of a single item (indicators), such as a single question concerning self-perception of general health [53] or a visual analogue pain scale, all unidimensional instruments produce index type scores. Many disease specific psychometric questionnaires, such as the Beck Depression Inventory, as well as most econometric measures are index-type measures. When more than one score is needed we refer to a profile. These categories are not exclusive: instruments can produce only index scores, others elicit profile scores only, and finally yet others can produce both of them.

At the beginning of the outcomes research movement, collections of different instruments, called batteries, were very popular [1]. They are not considered in the classification system because the focus is on individual instruments rather than their eventual combinations.

Adaptability is the third concept relevant to the measurement methods. By this, we refer to the extent to which the instrument can be tailored to the specific circumstances and preferences of each individual [51]. Most of the PRO questionnaires are completely standardized: they include explicitly formulated questions and predefined response options. There are obvious advantages inherent in such a high degree of standardization which explain the success of this type of instrument, most notably the simplification of procedures, the reliability of the estimates, and the comparability of the results obtained.

This approach, though, has been criticized for not taking the perspective of each individual patient into consideration, but rather that of the “average” patient or some other abstract subject [54, 55]. More flexible instruments have been developed, usually referred to as “patient-generated”, “patient-centered”, or “individualized” measures. In these instruments, domains and/or weights are not fixed. Each individual subject elicits them, indicating, for example, which activities or problems they would like to select for assessment. These measures might offer clear advantages over standardized instruments in clinical settings, where patient-centeredness is more an issue than standardization. Some instruments include a mix of both approaches and can be conceptualized as partially individualized.

Item banking and Computer Adaptive Testing (CAT) procedures allow reaching a high precision of measurement with shorter formats rather than increasing the sensitivity to preferences of the individual, and they should be classified as standardized [56].

Application of the classification system to frequently used instruments

In order to exemplify the use and applicability of the classification system we applied it to the ten most evaluated PRO measures [2]. We were able to classify all the instruments across all the axes and categories. All of them measured the constructs Symptoms and Functional Status. Most of them actually measured at least one additional construct, usually Health Perceptions, and two measured four different constructs. We present some examples of this analysis in Box 1.

Eight instruments were applicable to all adults, and two to adults with certain diseases only, but none of them was designed to measure reported outcomes in children. The majority were psychometric instruments (seven) and all of them were completely standardized (Table 3). For the purpose of exemplification, Table 3 also presents five additional instruments covering categories not applicable to the first 10.

Table 3 Applying the classification system to the selected patient reported outcome measures

Discussion

What is really new about our proposed PRO classification system

Firstly, previous classifications schemes considered either a simple list of examples of what constructs were implied without further elaboration [5, 20] or a list of the different features of the instruments without consideration to their underlying relationships [7]. Furthermore, previous schemes were not explicitly based on current conceptual models of health outcomes [22]. We have used an explicit methodology for the development of the classification system, and we have relied on a conceptual model that has proved valid and useful [37].

Secondly, this is a simplified classification system. While our proposal is based on attributes consistently used in the literature (the measurement object, the target population, and the measurement model) we do not consider it necessary to differentiate PRO measures by characteristics which do not fundamentally affect the nature of the instrument, such as different administration modes, weighting procedures, or use of full and reduced versions, among others [7].

Thirdly, it is important to note that ours is only a descriptive classification system and it does not provide any fundamental evaluation of the measurement properties of the instruments. We consider such evaluation crucial for adequate selection and interpretation of PROs. In fact, we have recently developed a standardized tool for the evaluation of such measures [55]. Classification and evaluation systems are complementary and should be used in tandem. Our approach results in at least three clear advantages over previous classifications [7, 13]: (a) less information is needed for the use of the system; (2) increased stability of the classification across different versions of the instruments [47, 48]; and (3) it can be applied from the very beginning in the development of the instrument.

Fourthly, previous attempts reduced differences between measures to their generic or specific nature, not considering any further systematic approach as to what is the intended population or to what disease or symptom it should be applied [7, 13]. Our proposal takes full advantage of the worldwide endorsement of the International Classification of Diseases ICD-10 of the WHO both for the construct of Symptoms as well as for instruments that are ‘disease specific’, and the International Classification of Functioning for the construct Functional Status. In light of our observation that most instruments (including those that focus on specific diseases) measure more than one construct, the traditional division in generic and specific measures seems very imprecise. Furthermore, all the criteria defining the populations are relevant from a clinical and a health services provision point of view [1].

Limitations

Apart from metric properties, a number of characteristics not included here may be of interest for describing a PRO instrument, such as time for completion, availability of interpretation guidelines, or the degree of patient involvement in the generation of the items. Classification systems aim to reduce to a minimum the information needed to identify an object, and we have limited our classification system to only three fundamental characteristics. But additional information may be relevant when choosing a PRO instrument for use in clinical practice or research. Even the detail in which the fundamental characteristics are considered may seem insufficient. Potential users may also be interested in whether a given instrument considers a particular symptom (e.g., pain). Our emphasis on simplicity may have compromised the amount of information available for each instrument in the system.

For the axis “construct”, our classification system relies on the health outcomes model proposed by Wilson and Cleary [29]. A number of different models have been proposed [37, 57], but they lack both the widespread use and the empirical evidence that supports the Wilson and Cleary model. However, this endorsement is contingent on available evidence. Should other models be tested and proved valid, this may result in a need for considering different constructs.

The inclusion of constructs other than those included in the Wilson and Cleary model has been justified on theoretical grounds, but the available evidence supporting the model may not necessarily apply to them. Research using available techniques as structural equation modeling will be needed to confirm their inclusion [3135]. In particular, we used a slightly modified version of the model by Wilson and Cleary, substituting the original “overall quality of life” with the more specific “health related quality of life”. We did so as supported by empirical data, but one of the revised instruments was found to measure the first instead of the latter (see Box 1).

Finally, the work presented here represents only an initial step towards the development and adoption of a common classification system of PRO measures. The application of this system to a database including about 400 PRO instruments will be our next step [58]. This process will allow us to test the classifications system with a greater variety of instruments and will provide invaluable information about the generalizability of the method.

How to use the classification system

A common framework for classifying PRO measures will serve different purposes for a broad range of health professionals. Researchers, clinicians, administrators and policy makers are all confronted with decisions based on these measures in their everyday work. All of them have now been provided with a solid guide to a better understanding of the nature of the instruments and to the interpretation of the related literature. Moreover, the system provides the basic information for identifying the candidate pool of PRO instruments available for use. Clinicians will be hereby assisted also in the selection of PRO instruments in their clinical practice [59, 60]. Administrators and policymakers are currently encouraged to integrate outcomes with existing process measures in order to get the most comprehensive view of the performance of health services [61]. Researchers, finally, will find this proposal a useful tool for the identification of areas lacking instruments and other areas where there might be a surplus of instruments [47].

Once the potential user is aware of the basic information describing what the instrument is designed to measure, a necessary next step is to compare, among the candidate instruments, the evidence that supports the robustness and adequacy of each candidate instrument. To evaluate these characteristics, a number of existing guidelines including attributes and criteria of adequacy exist [7, 35]. The use in tandem of our proposed classification and the standard evaluation guidelines should facilitate the adequate use of PRO instruments in clinical research, management, and practice.

An additional intended consequence is to foster discussion for a common definition of the constructs [3]. As a matter of fact, the application of our proposal to well-known instruments has revealed that they are much more heterogenic in their nature than had been claimed by their developers. The time has come to make the efforts conducive to the construction and adoption of a common terminology [22].