Chronic dysphagia resulting from a wide range of causes [1] contributes to morbidity and mortality [2] and has significant psychosocial effects [3]. Increasing referral numbers to dysphagia services compete for limited resources, with healthcare purchasers demanding clear evidence of beneficial outcomes. The assessment procedures used to diagnose and formulate a management plan and the resultant outcomes are appropriately coming under increasing scrutiny. The reliability of any technique needs to be determined: A useful dysphagia assessment needs to be sensitive and reproducible within and between clinicians. There are limited data to support many of the assessment procedures used in dysphagia [4]. The bedside or clinical swallow assessment (CSA) varies in its reported reliability [5] and validity in terms of comparison to the “gold standard” videofluoroscopic swallow study (VFSS) [6, 7, 8]. This “gold standard” itself has poor reliability for intra- and interjudge rating, and even these studies are limited to between 3 and 10 judges [9, 10, 11]. Novel techniques are often supported enthusiastically but with little evidence base. New interventions should be evaluated carefully but comparisons with existing techniques must be tempered with the knowledge that prevalence of use is not evidence of superiority.

Cervical auscultation (CA) is increasingly being used to supplement the CSA. The sounds associated with swallowing have been investigated using accelerometers and microphones for acoustic properties [12] and prediction of aspiration [13]. There are few robust studies of assessment by CA and no consensus has been reached on its reliability or validity. Reliability refers to the trustworthiness of an instrument: Is it consistent in the answers it is giving? Validity asks if the instrument is measuring what we expect. One of the conditions for validity is that an instrument must be reliable [14]. The most recent work suggests that agreement between judges is poor with only a few people suitably consistent within themselves to be classified as reliable [15]. The technique is part of the battery of tools used in the CSA, but it is possible that judgments of the sounds are unduly influenced by what has been read in the notes or already observed at bedside. In other words do raters in fact anticipate acoustic abnormality rather than detect it?

The objective of this study was to identify if clinicians experienced in CA could identify normal/abnormal swallow sounds from listening alone. The aims were to establish in a representative sample of judges:

  1. 1.

    The range of intrarater reliability: Is an individual consistent?

  2. 2.

    Their interrater reliability: Do colleagues agree?

  3. 3.

    The overall validity of CA against the “gold standard”: videofluoroscopy: Does CA get it “right”?

  4. 4.

    The association between intrarater reliability and validity of CA judgment: Do features such as experience or work pattern make an individual more reliable or more right?

Participants and Methods

Controls

Ten healthy volunteers were recruited to act as the control sample (median age = 72 years, range = 24–78 years). Exclusion criteria were previous history of dysphagia or eating/drinking difficulties, neurological impairment, cardiorespiratory disease, current medical conditions requiring medication, or structural abnormalities that could affect the swallowing or respiratory systems.

Dysphagic Stroke Patients

Over a 6-month period 20 consecutive dysphagic stroke patients (median age = 78 years, range = 65–90 years) who failed the CSA (showed clinical signs of dysphagia and to be at risk of aspiration) [16] were approached. Exclusion criteria were general medical unfitness (the consultant in charge deemed the patient too ill to participate), neurological condition other than stroke, methicillin-resistant Staphylococcus aureus (MRSA) as advised by infection control because of the use of noncleanable equipment, previous history of dysphagia or involvement in other studies, presence of tracheostomy tube because of interference with respiration and swallow sounds, or transfer or discharge before they could participate. Of the initial group 14 were recruited (1 transferred, 1 refused, 2 condition worsened, 1 had no next of kin to give assent, 1 MRSA late swab). The patients were studied a minimum of 48 hours poststroke. This allowed the physical system to stabilize and aimed to reduce the anxiety that the patient might experience poststroke. See Table 1 for characteristics of the 14 participating patients.

Table 1 Patient characteristics

Swallow Sound Raters

Speech–language pathologists (SLPs) with experience in dysphagia and CA were notified of the study and asked to consider participation. Thirty-one SLPs from regional and national special interest groups and local hospitals agreed to participate in the study. Dysphagia experience ranged from 1 to 13 years (median = 6 years) and CA experience ranged from 1 to 6 years (median = 5 years). Each rater completed a detailed questionnaire, (Appendix 1, SLT = SLP). Participants varied in all aspects addressed in the questionnaire (see Figs. 1, 2, 3, 4, 5, 6).

Fig. 1
figure 1

CA training level (Q3).

Fig. 2
figure 2

CAs performed per week (Q4).

Fig. 3
figure 3

CA procedure (Q6).

Fig. 4
figure 4

How CA is rated (Q7).

Fig. 5
figure 5

Self-rating of experience level (Q8).

Fig. 6
figure 6

CA practice pattern: When is CA used and where are the findings recorded (Q9)?

Ethical Approval and Consent

Written informed consent or assent was obtained for all participants in the study. The Newcastle and North Tyneside Joint Ethics Committee granted ethical approval for the study.

Equipment

This easily portable and noninvasive system is a development of an earlier one [16]. The sounds were recorded onto a notebook (Toshiba, Tokyo, Japan) computer hard drive via a Littmann Cardio III stethoscope (3M, Loughborough, UK), with a BL 1994 microphone (Knowles Acoustics, Burgess Hill, UK) inserted into the tubing at the bifurcation. The recording quality of the system was optimized to match what clinicians actually hear at bedside. Tube length and recording quality were modified iteratively until the consensus of two medical physicists (one with perfect musical pitch) and an experienced clinician agreed the sound was as close as possible to the sound heard via an identical unmodified stethoscope at bedside. Three SLPs were blindfolded and asked to listen to live and prerecorded swallow sounds and to comment on the quality. All of the sounds were from one healthy, nondysphagic person swallowing 5 ml water. There was no discernable difference between the live and prerecorded sounds. Interestingly, all of the SLPs reported hearing swallow sounds they thought abnormal and from stroke patients. Many studies have used accelerometers or even microphones but the recordings do not sound like those a clinician actually hears.

The stethoscope head is radiopaque which can cause interference with the image obtained during the VFSS. One option is to use a stethoscope with a radiotransluscent head [15]. We chose to use the radiopaque Littmann Cardio stethoscope as used at bedside to give realistic sounds and optimize recording quality for future acoustic analysis [17]. The stethoscope head was positioned on the neck over the lateral aspect of the thyroid cartilage, just off-center [18] and held in place by an elasticated Velcro band. A preliminary position screening was performed to check the image.

Test Bolus Materials

All studies of control and patient group participants were performed with simultaneous VFSS in the X-ray department. Three boluses each of 5 ml thin barium, 20 ml thin barium, and 5 ml yogurt were presented. E-Z-Paque Barium Sulphate Ph Eur (E-Z-EM Ltd, Bicester, UK) was used as the contrast material. Patients received fewer boluses if it was deemed clinically inappropriate to continue with the VFSS. The “thin barium” was a standardized runny barium sulfate contrast liquid of 52% weight/volume. This was the most dilute material that could be imaged clearly on VFSS but is still not rheologically identical to water. The liquid was stirred frequently to keep the barium in suspension since it settles out quickly, thus affecting the contrast.

The liquids were measured by graduated syringe into a small plastic cup; the participant was asked to drink the entire contents in one swallow in order to mimic real drinking as closely as possible. Injecting materials into the mouth may affect the normal swallow process, whether or not a person is then allowed to swallow at will. Yogurt was measured using an accurate 5-ml medicine spoon. Wherever possible, participants fed themselves. If needed, the SLP supported the cup/spoon or helped to feed while letting the participant guide the process.

The boluses were presented in the same order to all participants. To avoid “learner” effects or fatiguing, the boluses would ideally have been presented in a random order. This was not possible as the materials were presented in the standard order for a VFSS. Analysis of data on the development of the system indicated that there were no learner or fatiguing effects.

Swallow Sound Compact Disk

Ten sound clips were taken from healthy control group swallows (with no aspiration/penetration) and ten from patient group swallows (with radiologically defined aspiration/penetration) giving a total of 20 examples. All patient swallows were aspiration or significant penetration (which is an indication of an “at-risk patient” [19]). Current clinical practice in CA is to classify aspirators and penetrators as “abnormal.”

Swallow clips were first identified in the patient group where clear aspiration/penetration had occurred. These clips were then paired with control swallows with no aspiration/penetration. They were matched for bolus consistency and volume, gender, and, as close as possible, the age of the person. This reduced the clips to those from 7 control participants (median age = 73 years, range = 61–78 years) and 7 patient participants (median age = 78 years, range = 65–90 years). Swallow clips that were affected by voicing or coughing were excluded. The microphones were more sensitive than expected and occasionally the clinician could be heard commenting on the swallow. Such examples were also excluded. Where there was a choice of clip the first recording was chosen. This procedure was carried out to minimize the effect of researcher bias in choosing very different or very similar sound clips.

Sound clips were split randomized and recorded onto compact disk (CD). The CDs were sent to the 31 volunteer SLPs with written instructions (Appendix 2), a response form, the questionnaire, and a return post-paid envelope. The SLPs were asked to rate “normal or abnormal” swallow, to say if it was probable or definite, and then give any other qualitative comments. No definitions of “normal” or “abnormal” were given since no standard definitions exist in clinical practice. Responses were received from 19 of the 31 SLPs, of whom 15 said they would be prepared to rerate the sounds for intrarater reliability. The sound clips were rerandomized and recorded onto a second CD. After at least 4 weeks the 15 volunteers were sent the new CD and responses were received from 11 of these.

Data Collection/Statistical Analysis

Data were analyzed with SPSS for Windows (Release 11 .0, SPSS Inc, Chicago, IL) and Stata (SE 7, Stata Corporation, College Station, TX) packages. Observed agreement results are quoted together with the kappa values for predicted agreement. Kappa allows for the effect of chance and bias. Correlation of rater characteristics with reliability and validity was analyzed using Spearman’s coefficient for ranked non parametric data. For the purposes of statistical analysis of rater reliability it was deemed appropriate to collapse the data into dichotomous variables, hence the division of probable and definite was removed. Results were accepted as statistically significant at the 5% level.

Results

Individual Test Retest Reliability

Are we able to make a consistent judgment on the same sound from one hearing to another?

Figure 7 shows the 11 rerater responses:

  • The observed self-agreement, i.e., the number of times an individual’s ratings agreed when comparing the same sound on CD1 with CD2

  • Whether this self-agreement was on “normal” or “abnormal” sounds

  • The derived kappa statistic

  • The number of swallows the rater “correctly” identified (as defined radiologically)

Clinicians varied in their self-agreement from 9/20 to 17/20. That is, some individuals could rate the same sounds no better than chance (10/20) but some were much more consistent. To allow for chance and personal rating bias, the kappa statistic was calculated.

$${\rm{kappa}}\,=\,{{{\rm{(observed\ agreement - agreement\ by\ chance)}}} \over {{\rm{(perfect\ agreement-agreement\ by\ chance)}}}} $$

perfect agreement = 1

Fig. 7
figure 7

The first and second ratings of the same 20 swallows by 11 SLPs. Note that circles indicate swallows that were rated the same on both occasions. Whether the rating agreed with VFSS can be judged from the background shading; open circles should lie on a white background and vice versa.

Chance agreement is 0.5 for an unbiased observer. Chance agreement becomes more likely when an observer has an overall bias toward one or an other rating. For example, rater 2 self-agreed on 13 of 19 swallows. Rater 2 (with 22 normal, 16 abnormal ratings) had a kappa of 0.38. Rater 17 also self-agreed on 13 of 19 swallows but was rather more biased (12 normal, 26 abnormal) and had a correspondingly lower kappa of 0.27.

Intrarater individual kappas ranged from −0.12 to 0.71 with a mean of 0.35. Of these 11 SLPs, 7 judges rated “fair” or better according to the Landis and Koch guidelines [20]:

kappa value <0.20 poor agreement

0.21–0.40 fair

0.41–0.60 moderate

0.61–0.80 good

0.81–1.00 very good

0 = chance, <0 = worse than chance

Predictors of intrarater reliability

In total 148 swallows were classified the same on both ratings (Fig. 7). Of these, 74 (50%) were classified as normal. This finding implies that across the group, as a whole, self-agreement was not linked to whether a rater was classifying a swallow as normal or abnormal.

Of the 148 self-agreed swallows, 102 agreed with VFSS, i.e., were “correct.” Of these, 50 (49%) swallows were classifyed as normal, i.e., self-agreement classification was not influenced by whether the rating was VFSS “correct.” There was no correlation between an individual’s reliability (kappa) and “correctness” of self-agreement.

There was no correlation between an individual’s reliability and his/her behavior/practice pattern/experience nor self-acclaimed expertise level.

Group Agreement

Do we hear what our colleagues hear?

The interrater reliability of the 19 clinicians (based only on the first reading for the reraters) gave kappa of 0.17. For radiologically normal swallows in control participants (VFSS NO aspiration/penetration), kappa = 0.02. For radiologically abnormal swallows in poststroke participants (VFSS aspiration/penetration), kappa = 0.18. The lower kappa value could be interpreted as people being even more unreliable when rating normal swallows but both conditions were “poor.” Indeed, for 10 of the 20 swallows (Nos. 3, 4, 5, 7, 9, 10, 12, 17, 18, 20), there were raters who classified the swallow as “definitely normal” and raters who classified it “definitely abnormal.”

The interrater kappa of the 11 reraters (first reading) was 0.13, indicating that they were representative of the original 19 judges. We know from the intrarater results that there is a range of individual reliability, i.e., individuals will be affecting the overall reliability of the group. Similarly, some sound clips were more consistently rated than others, e.g., sound No. 1 was overwhelmingly scored as normal when aspiration/penetration did occur. This may have been clinically silent.

Figure 8 shows the total ratings for normal or abnormal for each of the 20 sounds rated by the whole group of 19 raters. How do we decide where the cutoff is for a significant vote one way or the other? Using the Landis and Koch guidelines:

Fig. 8
figure 8

The first ratings of 20 swallows by 19 SLPs. Whether the rating agreed with VFSS can be judged from the background shading; open circles should lie on a white background and vice versa.

For “fair” agreement, kappa ≥ 0.21, so

A score of greater than 11.5, i.e., 12, in one direction would show “fair” agreement between the raters. Similarly, or “moderate” agreement 14 raters and for “good” agreement 16 raters must vote in one direction.

  • 16 sounds were rated with fair agreement, 8 of which were “abnormal” on VFSS

  • 11 of these 16 sounds were rated with moderate agreement, 6 of which were “abnormal” on VFSS

  • 3 of the 11 sounds were rated with good agreement, all 3 of which were “abnormal” on VFSS.

A major concern for the researchers was that raters would guess that there were 10 normal and 10 abnormal sounds. Abnormal ratings of exactly 50% were obtained from only four raters in the interrater group. Abnormal ratings ranged from 6 to 13 of the 20 sound clips.

Validity

Do our judgments agree with VFSS?

Using VFSS as the “gold standard,” 125/190 normal ratings matched no aspiration/penetration, i.e., specificity of 66%. Similarly, 117/190 abnormal ratings matched aspiration/penetration, i.e., sensitivity of 62%. Ratings can be used to predict the results of VFSS (via positive and negative predictive values); however, as this study is concerned with assessing raters’ reliability and ability to match the gold standard, sensitivity and specificity values have been calculated. It is inappropriate to calculate predictive values where the prevalence has been artificially controlled.

Sensitivity and specificity are based on individuals’ ratings. If we look at majority consensus (Fig. 8, bottom row) the group correctly identified 9/10 normal and 8/10 abnormal swallows (90% specificity and 80% sensitivity).

There was a significant relationship between an individual’s reliability and true positive rate (rs = 0.623, p = 0.040).

Discussion

This is the largest study to date of rater reliability in CA, using 19 initial raters, 11 reraters, and 20 swallow sounds. There is a dearth of robust studies in the field of CA. Studies involving reasonable numbers of swallow sounds have often had few raters, lack of a healthy, asymptomatic control group, or lack of simultaneous VFSS for a gold standard comparator [13, 15]. This is one of the few studies to have recorded swallow sounds that match what a clinician really hears at bedside. This study isolated sounds to reduce the influence of previous knowledge.

Reliability of CA

The data clearly demonstrate that few people are reliable in CA. This unreliability in an individual’s judgments will affect the usefulness of acoustic characterization of swallow sounds as reported by Cichero and Murdoch [12]. The clinical applicability of such detailed analysis will be hampered until we can improve intrarater reliability.

Demographic data analysis revealed that reliability of a clinician is independent of factors historically presumed to improve skills, such as years of experience (Figs. 1–6). The definitions of aspects such as practice and experience were left deliberately vague. We do not know if any person can be trained to improve his/her listening skills, or are there some whose background or innate ability predisposes them to be more reliable auscultators? For example, is someone with a “musical ear” at an advantage when listening and characterizing sounds of any type? Clinicians using this technique do not get their hearing tested routinely: this could be a significant factor affecting an individual’s performance.

More detail is required before we could draw definite conclusions on, say, how much and what type of training a clinician should have, and continuing peer review. Training and discussion have been shown to improve the reliability in VFSS judgments [21]. A system such as we describe here will be invaluable in future studies to address the manifestly unstandardized current approaches to CA. We would then be in a position to quantify the “added value” of this technique to the CSA when investigating the effect of previous knowledge of patient history for example.

The agreement among the group of raters was poor. This matches established techniques such as VFSS and laryngoscopic swallow studies.

Validity

The overall accuracy of the technique in identifying aspiration/penetration was limited because of individual variation in reliability. The validity of a technique is dependent upon reliability and this is borne out by the results of this study. Since the group consensus correctly classified 17 of the 20 swallows, we may speculate that the swallow sound contains audible cues that should in principle permit reliable classification. If we could improve the poor raters, we would improve the overall accuracy of the technique in detecting abnormality in swallowing. The question of auscultation improving the accuracy of the CSA has yet to be answered but these issues will affect it.

The Future

What is the physiology behind swallow sounds? Future work should expand the limited evidence base of synchronized sounds and images from both VFSS and laryngoscopy. Analysis of simultaneous sound and image data will contribute to the continuing debate: What can cervical auscultation detect and what does it contribute to the clinical swallow assessment?