Background

Voice problems or dysphonia can be defined as any deviation in voice quality, pitch or loudness, inappropriate for an individual’s age, gender or cultural background. Dysphonia can result from alterations in respiratory, laryngeal or vocal tract mechanisms (organic dysphonia), improper or inefficient use of the vocal mechanism (functional dysphonia) or psychological stressors (psychogenic dysphonia) [1,2,3]. Voice problems can manifest in different voice modalities (e.g. speaking voice, singing voice and shouting voice) [4]. The burden of dysphonia, its impact on quality of life and work-related effects are increasingly recognised [5].

The prevalence of dysphonia in the general population has been estimated at 0.98% [5]; however, prevalence rates are highly dependent on variables such as gender, age or occupational factors. The lifetime prevalence of a voice disorder may be as high as 29.9% [6], with even higher risks in people for whom using their voice is critical to their vocation, such as teachers [7]. Prevalence data also differ because of variations in the instruments used for measurement. Most researchers and clinicians agree on the fact that voice is a multidimensional phenomenon and follow the guidelines for functional assessment of voice pathology laid out by the Committee on Phoniatrics of the European Laryngological Society [8]. In these guidelines, a multidimensional set of minimal basic measurements for all ‘common’ dysphonias is proposed, involving five different approaches: perception, videostroboscopy, acoustics, aerodynamics and subjective rating by the patient [8]. Still, having reached a consensus on approaches does not imply an agreement on the measures to use in the assessment protocol for voice pathology. Most importantly, the use of a measure in research or clinical practice can only be justified by having robust psychometric properties: reliability, validity and its discriminative and evaluative ability [9].

Subjective ratings in persons with dysphonia include self-report questionnaires on health-related quality of life (HR-QoL) and functional health status (FHS). HR-QoL is a term that specifically refers to health-related aspects of quality of life, generally considered to reflect the impact of disease and treatment on disability and daily functioning [10]. HR-QoL is the unique personal perception an individual has of his or her health, taking into account social, functional and psychological factors [11]. HR-QoL, however, differs from quality of life as quality of life is a broader construct that encompasses more aspects than just health; that is, quality of life refers to an individuals’ perception of their position in life in the context of the culture in which they live and in relation to their goals, expectations, standards and concerns [12, 13]. FHS in turn, refers to the influence of a given disease on particular functional aspects [11]. Function is an umbrella term encompassing all body functions, activities and participation [14]. The health state of an individual at a particular point in time, may be modified by functional states, impairments, perceptions and social opportunities that are influenced by disease, injury treatment and health policy [15]. Still, the distinction between both HR-QoL and FHS concepts can become blurred in measurement. For instance, many self-report questionnaires in dysphonia frequently include items or subscales related to both FHS and HR-QoL. As such, distinction between these concepts is often not possible when using these questionnaires, whereas reduced HR-QoL or impaired FHS may require a different management or intervention.

To select appropriate measures from the available self-report questionnaires, the psychometric properties of each questionnaire must be evaluated and compared. Clinicians and researchers need to carefully select measures with optimal psychometric properties to ensure adequate quality and appropriate interpretation of results. Measures lacking robust psychometric quality cannot guarantee sufficient reliability and validity of retrieved results, thus opposing evidence-based clinical practice and research. The COSMIN group (COnsensus-based Standards for the selection of health Measurement INstruments) established an international consensus-based taxonomy, terminology and definitions of measurement properties for health-related patient-reported outcomes [16]. The framework comprises nine measurement properties subsumed within three domains: reliability, validity and responsiveness. In addition, the COSMIN checklist was developed, providing a standardised and validated tool to rate the methodological quality of studies describing the psychometric properties of self-reported measures in health [17]. The COSMIN framework and checklist have been used in over 560 psychometric reviews (see website: database.cosmin.nl/) and is grounded in contemporary literature, thus representing the most appropriate methodology to address the psychometric properties of self-report questionnaires in dysphonia.

A simplified checklist to operationalise measurement characteristics of patient-reported outcome measures was developed for reviewers, researchers and clinicians with varied expertise in psychometrics and/or clinimetrics [18]. However, the authors’ criteria stand in contrast to the COSMIN checklist which is a complex tool that requires users to have expertise in psychometrics. Furthermore, the simplified checklist shows several methodological shortcomings and received robust critique from the COSMIN group [19]. First, the methodological quality of studies should be distinguished from the effect sizes in trials and separated from the quality of the patient-reported measure itself. This is not the case for the simplified checklist. The results of studies with insufficient methodological quality, may be biased. Therefore, in line with Cochrane methodology, the methodological quality of studies on measurement properties needs to be rated before rating study results; the precise purpose of the COSMIN checklist [19]. In addition, the evaluative criteria of the simplified checklist do not provide sufficient detail for unbiased and systematic rating of the quality of the measure, due to its simplicity [18]. For example, criteria are lacking on what constitutes good content validity, dimensionality or responsiveness. As this checklist was developed for users with limited methodological background, a lack of clarity and standardisation in rating introduces bias upon what constitutes good measurement properties [19].

A psychometric review on voice-related patient-reported outcome measures using the aforementioned simplified checklist was recently published [20]. Due to methodological shortcomings inherent to this checklist, the psychometric properties of self-reported questionnaires in dysphonia remain unclear. No other psychometric reviews in the field of dysphonia have been published.

Study aim

This systematic review aimed to identify all current self-report questionnaires on FHS and/or HR-QoL in dysphonia for adult populations, and to evaluate the psychometric properties of these questionnaires using the COSMIN framework and checklist.

Methods

The PRISMA statement [21] and the COSMIN [16, 17] guided the methodology and reporting of this systematic review. This review consists of three consecutive steps: (1) performing a systematic literature search; (2) rating the methodological quality of studies reporting on psychometric properties using the COSMIN checklist [17]; and (3) rating the quality of each measurement property for all questionnaires using pre-defined criteria [22, 23].

Eligibility criteria

Self-evaluation questionnaires on FHS and/or HR-QoL in dysphonia, as well as research articles and manuals reporting on the psychometric properties of FHS and/or HR-QoL questionnaires were considered for inclusion in this review. Only questionnaires developed and published in English and research articles and manuals written in English were eligible for inclusion. Questionnaires targeting adults with dysphonia were included, but questionnaires focussing on vocal training were excluded from this review as these measures target a different population. Persons with dysphonia have voice disorders, whereas professional voice users mainly aim at voice optimisation [24]. Single-item questionnaires were excluded as FHS and HR-QoL are complex constructs that cannot be captured by single items only. Furthermore, questionnaires that were not a comprehensive measure were also excluded; for measures to be considered comprehensive, they needed to produce an overall or summative score that was calculated based on the reporting of a number of individual items that collectively comprise the construct of dysphonia.

A minimum of 50% of all items within a questionnaire were required to target the measurement of FHS and/or HR-QoL in dysphonia for the measure to be included. Conference abstracts, reviews, student dissertations and editorials were not considered for inclusion.

Literature searches and study selection

Systematic literature searches were performed in two different databases: Embase and PubMed. First, databases were searched for self-evaluation questionnaires on FHS and/or HR-QoL in dysphonia (see Supplementary Table 1). Next, additional searches were conducted to identify publications on the psychometric properties of the retrieved questionnaires (see Supplementary Table 2). The final searches were conducted in November 2016; all publications that met the eligibility criteria and were published before this date were included. Two independent reviewers performed the abstract and article selection process. Discrepancies in abstract selection were resolved by consensus between both reviewers. Differences in the final selection of questionnaires or research articles were resolved by group consensus.

Methodological quality assessment of studies on psychometric properties

To evaluate the methodological quality of the selected studies on psychometric properties, the COSMIN taxonomy of measurement properties and definitions for health-related patient-reported outcomes was used [16]. The COSMIN framework comprises nine measurement properties: internal consistency, reliability (including test–retest, inter-rater and intra-rater reliability), measurement error, content validity (including face validity), structural validity, hypothesis testing, cross-cultural validity and criterion validity. Table 1 presents definitions for each measurement property used for this review, as guided by the COSMIN statement [16]. Interpretability is not considered to be a psychometric property within the COSMIN framework and was therefore excluded from this review. Responsiveness was outside the scope of this review, and as only original English questionnaires were included, cross-cultural validity was not evaluated. Criterion validity could not be assessed due to the lack of a ‘gold standard’ measure in the field of FHS and HR-QoL in dysphonia.

Table 1 Definitions of measurement properties for health-related patient-reported outcomes instruments according to COSMIN [16]

The COSMIN checklist [17] is a standardised tool and was used to rate the methodological quality of the studies describing the psychometric properties of the included questionnaires. Each measurement property is rated individually, and the checklist for each measurement property contains 5–18 items rated on a four-point scale (poor, fair, excellent, good). The items rate the quality of study design and the robustness of statistical analyses performed in studies on the domains reliability, validity and responsiveness. When using a ‘worst rating counts’ system, the final quality rating for a measurement property is equivalent to the lowest rating given to any of the items contained in the checklist for that property [25]. As this method impedes the detection of subtle differences in methodological quality between studies, a revised scoring procedure was developed [26,27,28], and this is the method used in this review. Final quality ratings for measurement properties are presented as a percentage using the following formula:

$${\text{Total}}~{\text{score}}~{\text{per}}~{\text{psychometric}}~{\text{property}}=~\frac{{\left( {{\text{Total}}\,{\text{score}}\,{\text{obtained}} - {\text{Min}}~{\text{score}}~{\text{possible}}} \right)}}{{\left( {{\text{Max}}~{\text{score}}~{\text{possible}} - {\text{Min}}~{\text{score}}~{\text{possible}}} \right)}} \times 100\%.$$

The total percentage score is then categorised as poor (0–25%), fair (25.1–50%), good (50.1–75%) or excellent (75.1–100%). Two independent raters with expertise in COSMIN scoring, completed all ratings. To ensure consistency in scoring, a random selection of 40% of all articles retrieved was rated by both raters. The inter-rater reliability was determined by calculating the weighted Kappa between raters.

Quality of measurement properties

Once the methodological quality of the included studies was determined, the quality of the measurement properties was evaluated. Research articles that received a poor COSMIN rating were excluded from further analysis. To address the quality of the measurement properties of each questionnaire, psychometric data were retrieved from the selected research studies and were rated according to pre-defined quality criteria per measurement property [22, 23] (see Supplementary Table 3). Measurement properties could receive a positive, negative, or indeterminate rating. In cases of methodological issues, such as problems in study design or statistical analyses, ratings were classified as indeterminate.

Overall quality of psychometric properties

Finally, an overall quality score for each measurement property evaluated for each assessment was determined using criteria [23]; these levels of evidence combine the COSMIN ratings for assessing the methodological quality of studies on psychometric properties, and the corresponding quality assessment of psychometric data retrieved from these studies. As a result, an overall quality rating per psychometric property for each questionnaire can be obtained.

Results

Systematic literature search

The first systematic literature searches identified self-evaluation questionnaires on FHS and/or HR-QoL related to dysphonia. After deletion of duplicates, a total of 2214 abstracts from Embase (1118 records) and PubMed (1487 records) were identified. Figure 1 presents the flow diagram according to PRISMA [29]. A total of 67 questionnaires were assessed for eligibility, resulting in 15 questionnaires meeting all inclusion criteria. Supplementary table 4 provides a list of the 52 excluded measures and reasons for exclusion.

Fig. 1
figure 1

Flow diagram of the reviewing process according to PRISMA

Additional searches were conducted to retrieve publications on the psychometric properties of the included questionnaires, resulting in a total of 937 abstracts (excluding duplicates): 334 records from Embase and 731 records from PubMed. Data on psychometric properties were retrieved from the literature for all questionnaires. Forty-eight articles reported on at least one or more psychometric properties of any of the 15 questionnaires on FHS and/or HR-QoL in dysphonia. No manuals of questionnaires were located.

Measures of FHS and HR-QoL in dysphonia

The following 15 questionnaires on FHS and/or HR-QoL were identified: evaluation of the ability to sing easily (EASE) [30], Glottal Function Index (GFI) [31], Singing Voice Handicap Index (SVHI) [32], Singing Voice Handicap Index-10 (SVHI-10) [33], Transgender Self-Evaluation Questionnaire (TSEQ) [34], Transsexual Voice Questionnaire—Male to Female (TVQMtF) [35], Vocal Fatigue Index (VFI) [36], Vocal Performance Questionnaire (VPQ) [37], Voice Capabilities Questionnaire (VCQ) [38], Voice Disability Coping Questionnaire (VDCQ) [39], Voice Handicap Index (VHI or VHI-30) [40], Voice Handicap Index-10 (VHI-10) [41], Voice Rating Scale (VRS) [42], Voice-Related Quality of Life (V-RQOL) [43] and Voice Symptom Scale (VoiSS) [44]. Nine questionnaires combined items on HR-QoL and FHS (EASE, SVHI, SVHI-10, TSEQ, TVQMtF, VHI, VHI-10, V-RQOL and VoiSS), and six questionnaires mainly focussed on FHS (GFI, VFI, VPQ, VCQ and VRS) or HR-QoL (VDCQ). All fifteen questionnaires targeted persons with dysphonia of which three questionnaires aimed at the singing voice (EASE, SVHI and SVHI-10) and two questionnaires at the transgender population (TSEQ and TVQMtF) in particular. Two questionnaires were shorter versions of the original questionnaires: the SVHI-10 and the VHI-10 were shorter versions of the SVHI and VHI, respectively.

Details on the 48 studies on the development and validation of the included questionnaires on FHS and/or HR-QoL in dysphagia are summarised in Supplementary Table 5. Supplementary Table 6 summarises the characteristics of all 15 questionnaires, including names and number of subscales, number of items and response options. Eight questionnaires have no subscales, five questionnaires have three subscales (EASE, TSEQ, VFI, VHI, VoiSS), one questionnaire has two (V-RQOL) subscales and one questionnaire has four (VDCQ) subscales. No cut-off scores are used in any of the included questionnaires (for example, to distinguish between normal voice and dysphonia). All but one questionnaire use a Likert response scale as response option, whereas only the VRS uses visual analogue scales. The total number of items varies between 4 and 36.

Methodological quality assessment

The COSMIN checklist [17] was used to assess the methodological quality of the 48 included studies. Supplementary Table 7 presents an overview of all COSMIN ratings. Studies that described the psychometric properties of more than one questionnaire were rated multiple times, for each questionnaire separately. Only two studies received poor COSMIN ratings [45, 46] of which one study received a poor rating for one of the analyses (hypothesis testing) [46]. Data resulting from analyses with poor COSMIN ratings were excluded from further analysis, thus leaving 47 studies. All remaining studies were rated as having sufficient methodological quality for further analysis. All studies but four reported on hypothesis testing. Limited information was retrieved on internal consistency (18 studies), reliability (9 studies, mainly intra-rater reliability), content validity (12 studies) and structural validity (9 studies). No data were identified on measurement error. The inter-rater reliability between both COSMIN raters was very good: weighted Kappa 0.93 (95% CI 0.84–1.00).

Quality of measurement properties of assessments

Supplementary Table 8 presents the quality of the psychometric properties retrieved from 47 included research articles for all 15 questionnaires based on pre-defined quality criteria [22, 23]. Details on rating criteria are summarised in Supplementary Table 3. The overall, integrated quality score for each measurement property per questionnaire was determined using the criteria or levels of evidence [23], and is presented in Table 2. The overall level of psychometric quality is determined by integrating the methodological quality ratings of the included studies using the COSMIN checklist (Supplementary Table 7) with the quality criteria for measurement properties of the questionnaires [22, 23] (Supplementary Table 8).

Table 2 Overall quality score per measurement property per questionnaire

None of the measures reported psychometric data on all six measurement properties. In particular, data on measurement error were lacking for all fifteen measures. In total, 42% (38/90) of all six psychometric properties for all fifteen measures were not reported on and 32.7% (17/52) of scores that were reported on were classified as indeterminate ratings. All but two measures (VCQ and VRS) showed positive overall quality scores for at least one psychometric property, whereas four measures (SVHI, SVHI-10, VFI and VoiSS) received a negative overall quality score for a single measurement property. Two measures showed conflicting psychometric data for a single property (VHI-10 and V-RQOL).

Discussion

The purpose of this systematic review was to identify self-report questionnaires measuring FHS and/or HR-QoL related to dysphonia for adult populations, and to determine the quality of their psychometric properties according to the COSMIN taxonomy.

Findings on psychometric properties

This review identified 15 questionnaires and 48 studies describing at least one psychometric property of one or more of the included questionnaires. No manuals were retrieved. Of those studies with sufficient methodological quality (47 studies), 12 studies determined psychometric properties of more than one questionnaire, including hypotheses testing describing associations between two of the included questionnaires. The number of psychometric properties per questionnaire addressed in each study was limited. Most studies (31 of 47; 66%) addressed a single psychometric property; however, 10 of the 15 questionnaires had evaluated four or more psychometric properties. Furthermore, 47% (48 of 102) of all quality ratings on psychometric properties retrieved from the 47 studies were classified as indeterminate, which resulted in 33% (17 of 52) of the overall quality scores per psychometric property per questionnaire being classified as indeterminate. Therefore, when describing the psychometric characteristics of FHS and/or HR-QoL questionnaires in dysphonia, many data in the literature are lacking or remain unclear due to methodological or statistical flaws in the identified psychometric studies. As a consequence, the findings from this systematic review indicate an incomplete psychometric overview and the generalisability and interpretation of results remain limited.

For two questionnaires, only data on a single psychometric characteristic were retrieved and for another three questionnaires data were found on two characteristics. For six questionnaires data were reported on for four psychometric characteristics, and for four questionnaires data were reported on for five characteristics. Hypotheses testing was most frequently determined (13 of 15), next internal consistency (12 of 15) and reliability (10 of 15), followed by structural validity (9 out of 15) and content validity (8 of 15). For all but two questionnaires [35, 38], data were retrieved for at least one aspect of validity (content validity, structural validity or hypotheses testing). No data were identified on measurement error for any of the questionnaires. Responsiveness was out of the scope of this review; cross-cultural and criterion validity were also not determined as only questionnaires developed and published in English were included and no ‘gold standard’ instrument for FHS and/or HR-QoL in dysphonia was identified.

Based on the available psychometric data for the 15 included questionnaires and excluding those questionnaires with negative (SVHI, SVHI-10, VFI, VoiSS) or conflicting ratings (VHI-10, V-RQOL), the VHI seemed to be the most promising questionnaire. The VHI showed strong positive evidence for hypotheses testing, moderate positive evidence for three further properties (internal consistency, reliability, structural validity) and an indeterminate rating for content validity. Next best was the VPQ with strong positive evidence on reliability and limited positive evidence on three other properties (internal validity, structural validity, hypothesis testing). The EASE and VDCQ showed positive evidence on two psychometric properties. For the EASE strong positive evidence was found for internal consistency and structural validity, and indeterminate ratings for content validity and hypothesis testing. The VDCQ showed limited positive ratings for content validity and structural validity, and had indeterminate ratings for internal consistency and hypotheses testing. Three questionnaires received positive ratings for a single property: the TVQMtF (strong positive rating for reliability and indeterminate rating for internal consistency), the GFI (moderate positive rating for hypothesis testing and indeterminate rating for reliability) and the TSEQ (moderate positive rating for hypothesis rating).

In addition to its psychometric properties, the reasons for selecting a questionnaire may depend on clinical or research purposes. Therefore, population-specific measures may be preferred, such as questionnaires targeting singers (EASE, SVHI or SVHI-10) or male-to-female transsexual women (TSEQ or TVQMtF). The included questionnaires [15] differed in target populations, instrument purposes and measure format, including the number of subscales, items and response options. Despite these differences, all measures had in common that they were self-report questionnaires on FHS and/or HR-QoL in dysphonia in adult populations and, therefore, included in this review. The psychometric evidence on the measurement properties of the questionnaire needs to be considered before a final decision can be made as to which questionnaire to select. Incomplete data on psychometric properties of questionnaires do not necessarily imply poor psychometric quality; however, selection of these questionnaires is not currently supported by robust evidence. This lack of psychometric data on many of the questionnaires on FHS and/or HR-QoL is therefore concerning. For example, if no psychometric data are available on content validity, doubt may arise as to whether the content of the questionnaire adequately reflects the construct being evaluated. This would contradict the use of the questionnaire as content validity is considered as one of the most important measurement properties [47]. Likewise, the use of a questionnaire with negative psychometric evidence cannot be justified based on its psychometric properties.

A recent review on the psychometric properties of voice-related patient-reported measures [20] summarised findings based on a newly developed simplified checklist (present or absent) of evaluative criteria to operationalise measurement characteristics [18]. The COSMIN group criticised the methodological shortcomings of this checklist strongly [19], indicating that the psychometric properties of questionnaires in dysphonia remain unclear based on the review using the simplified checklist [18]. Given that the inclusion and exclusion criteria of this review differed to that of the review using the simplified checklist [20], eight of the 15 included questionnaires in the current review were also reviewed in the previous review [20]: GFI, VFI, VPQ, VDCQ, VHI, VHI-10, V-RQOL and VoiSS. When considering the eight measures that overlap between both reviews, the previous review [20] favours the use of the V-RQOL over the use of the other seven measures when considering their developmental measurement properties and applicability. In contrast, the V-RQOL only achieved limited and moderate positive ratings for two psychometric properties, and indeterminate and conflicting ratings for another two properties, respectively, when using the COSMIN taxonomy. When evaluating the psychometric properties of the same eight measures using the COSMIN taxonomy, the VHI followed by the VPQ were found to be the most promising measures. Even though the previous review [20] rated the VHI as the equal fourth best measure, the VPQ received a much lower rating (second lowest rating within the selected eight measures when using COSMIN). These findings indicate that not only does the simplified checklist have methodological shortcomings as outlined by the COSMIN group [19], its use leads to different results compared to the COSMIN taxonomy. The terminology, interpretation of identified psychometric data and overall quality ratings for measurement properties using the simplified checklist [20] differ substantially from the psychometric data reported in the current review using the COSMIN taxonomy and checklist. Our findings contra-indicate the use of the simplified checklist [18] for evaluating the psychometric properties of measures for clinical and research purposes.

Remarkably, all but one study in this review used classical testing theory (CTT). Only one study [30] used the more recently developed item response theory (IRT) to determine psychometric properties. Even though the methodologies and interpretation of CTT findings are easier to interpret than those of IRT, the CTT framework has some limitations. In contrast to IRT where the unit of analysis and results are not restricted to the test population, the evaluation of psychometric properties in CTT is specific to the test population. Further, CTT assesses the performance of a measure as a whole, whilst IRT evaluates the reliability of each individual item [48]. The IRT models estimate both item and person parameters within the same model, calculate person-free parameter estimation and item-free trait level estimation, and identify optimal scaling of individual differences based on the evaluation of differential item functioning [49]. Based on the added value of IRT, future studies on the development and validation of measures should consider using IRT instead of CTT.

Limitations

This review has some limitations; only questionnaires validated in English and psychometric studies published in English were included. Therefore, some psychometric findings on FHS and/or HR-QoL questionnaires in dysphonia may have been excluded. Further, not all authors who published on the psychometric properties of the included questionnaires were contacted. Finally, we did not report on all nine psychometric properties within the COSMIN framework; criterion validity was not included because no agreed gold standard in the field of FHS and/or HR-QoL in dysphonia is available, and responsiveness was out of the scope of our current review. As interpretability is not considered a psychometric property within the COSMIN taxonomy, which has been confirmed in more recent literature [50], interpretability was also not reported on.

Conclusions

This systematic review reports on the psychometric properties of 15 self-reported questionnaires for the evaluation of FHS and/or HR-QoL in adults with dysphonia. The COSMIN taxonomy and checklist were used to assess the methodological quality of 48 studies reporting on psychometric characteristics of the included questionnaires. Quality criteria were used to rate the psychometric data on measurement properties for each study [22, 23]. An overall quality score per measurement property per questionnaire was determined by applying the criteria or levels of evidence [23]. Only preliminary conclusions can be drawn as many psychometric data proved missing or indeterminate for all questionnaires included. Based on current available psychometric data from the literature, the VHI seems to be the most promising questionnaire, followed by the VPQ. More research is needed to evaluate the quality of the psychometric properties of existing questionnaires that has not been tested to date, and augment evaluations of questionnaires using both IRT modelling and international consensus-based psychometric quality criteria and terminology, such as the COSMIN framework.