Introduction

Flexible endoscopic evaluation of swallowing (FEES) is frequently used in clinical and research practice to visualize the pharynx, larynx, and subglottis before, during, and after swallowing [1, 2]. One reason clinicians and researchers use FEES to assess swallowing is to obtain detailed information related to functional swallowing outcomes, including the frequency and severity of impaired swallowing efficiency (pharyngeal residue) and swallowing safety (penetration-aspiration) [2, 3]. This is because current evidence suggests that endoscopic swallowing evaluations possess a sensitivity for assessing pharyngeal residue, penetration, and aspiration that is either comparable to or greater than that of videofluoroscopic swallow studies [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23].

Ninety to 100% of speech-language pathologists ‘usually’ or ‘always’ assess pharyngeal residue, penetration, and aspiration when interpreting instrumental swallowing assessments [24,25,26]. Despite this, there is a high degree of variability regarding how clinicians interpret FEES [13, 27]. In a 2016 survey by Pisegna and Langmore, clinician respondents indicated that the perceived challenges associated with interpreting FEES included: (1) identifying anatomic structures; (2) identifying penetration and aspiration; and (3) knowing how to rate residue. In fact, eight methods to rate residue were reported. The majority of respondents (20.3%) reported rating residue based on the amount of residue filling or covering an anatomic structure (an ‘anatomically defined residue estimation’ method), with a slightly lower number of respondents (18.8%) reporting that they rated residue based on the amount of bolus remaining in the pharynx relative to the estimated amount originally swallowed (a ‘bolus clearance estimation’ method) [13]. Nearly 17% of the respondents provided vague responses of how residue was rated. This ambiguity and variability in FEES rating methodologies can negatively affect reliability among examiners [28] and makes it difficult to validly compare FEES results across studies. Therefore, there is a significant need to standardize methods for FEES analysis.

Several scales have been developed for, or adapted to, FEES to address these standardization needs [27]. While all of these scales use a categorical rating system, they vary considerably in terms of the outcomes measured, the number of categories used, and the definitions associated with each severity category (Table 1). Additionally, categorical rating methods for FEES [28,29,30,31,32,33,34,35,36,37,38] may not be as reliable or as sensitive as visual analogue scales [39,40,41] due to their limited ability to document relatively small unit differences [42,43,44]. In fact, emerging evidence suggests that a visual analogue scale may be a more valid method for rating residue when compared to categorical rating methods [39,40,41]. Furthermore, there remains ambiguity and inconsistencies related to how anatomic and temporal boundaries are defined within and across these scales, which may limit the reliability and generalizability of findings across studies and clinical practices.

Table 1 Comparisons of FEES Scales

Given the above, the primary aim of this study was to describe the development of the Visual Analysis of Swallowing Efficiency and Safety (VASES)—a standardized approach for rating pharyngeal residue, penetration, and aspiration during FEES. Because standardization of FEES analysis relies on developing a rating method that is feasible to train and implement [45], the secondary aim of this study was to explore the feasibility of training and implementing VASES in a novice group of FEES raters.

Methods

VASES Development

This study was approved by the Teachers College, Columbia University’s Institutional Review Board (IRB #: 21-071). The development of VASES included a consensus panel of six ASHA-certified speech-language pathologists who were independent in the performance and interpretation of FEES, and who had previously published research on topics related to the performance and interpretation of FEES. Three of the panel members were PhD level clinicians with greater than 5 years of treating dysphagia, and three were master’s level clinicians with 1–5 years treating dysphagia. All six members obtained dysphagia education and clinical training at differing locations nationally and internationally, including California, Florida, Massachusetts, New York, and New Zealand. VASES development occurred within the context of seven open format discussion meetings. The consensus panel aimed to develop detailed rules and operational definitions for the VASES rating methodology based on: (1) the challenges previously reported for FEES interpretation [13]; (2) the observation that most clinicians use FEES to rate residue, penetration, and aspiration [24,25,26]; and (3) the presence of inconsistencies and ambiguities within and across scales related to rating scale method (type and description of severity rating levels) and the temporal and anatomic boundaries within which to rate residue, penetration, and aspiration [29,30,31, 33, 34, 46].

Once these initial rules and operational definitions were established, four of the six panel members re-convened to blindly rate 55 de-identified FEES video clips as a consensus panel using the VASES rules and operational definitions. Each video clip contained a single bolus trial. VASES outcomes were rated within the context open format group discussion using one computer monitor. In a rotating fashion, one of the four panel members would provide initial VASES ratings for all seven outcome measures. Then, the remaining three members of the consensus panel would indicate if they approved the outcome measure ratings or requested a revision to any of the outcome measure ratings. A majority of the panel members (three to four) was required to approve the final rating in order for it to be used as a subsequent criterion reference. The process was repeated for each video clip, with a new member of the consensus panel leading the initial rating. Consensus panel ratings served two purposes. The first purpose was for the consensus panel members to pilot the VASES rating method and to identify areas requiring further refinement. The second purpose was to create criterion references for training proposes.

The FEES video clips were pulled from an outpatient clinical research database of people with dysphagia and neurodegenerative disease. The FEES equipment used in these video clips was a 3.0 mm diameter flexible distal chip laryngoscope (ENT-5000; Cogentix Medical, New York, USA) and video system with integrated LED light source LCD display (Cogentix Medical, DPU-7000A). During the FEES, the flexible laryngoscope was passed transnasally, without the use of topical anesthetic or vasoconstrictors. The tip of the endoscope was positioned within the oropharynx to visualize the pharynx, larynx, and subglottis before, during, and after all swallows. As needed, the endoscope was advanced throughout the pharynx and laryngeal vestibule after each swallow to more closely inspect residue patterns throughout the pharynx, laryngeal and subglottic spaces. FEES were completed by, or under the direct supervision of, a speech-language pathologist experienced in the performance and interpretation of FEES. Video clips included thin liquids (IDDSI 0) and puree (IDDSI 4) bolus trials. Thin liquid boluses were impregnated with contrast material to enhance visualization, and included either blue dyed water, green dyed water, white dyed water, or thin liquid barium [47, 48]. Videos included a convenience sample of previously recorded FEES intended to represent a range of rating difficulty as it relates to identifying anatomic and temporal boundaries for FEES interpretation.

Feasibility of VASES Training and Implementation

VASES Training Protocol

Twenty-six novice raters were recruited from a graduate school speech-language pathology department. All raters were master-level students who had one semester of dysphagia coursework with at least one lecture on FEES. The coursework and FEES lecture were from the same instructor for all novice raters. All raters were in the third semester of their training program at the time of the study and had completed one semester of internship training within the university clinic. The internship training did not involve interpretation of FEES or VFSS. Raters were instructed to complete three study phases: (1) Pre-Training Assessment, (2) VASES Training, and (3) Post-Training Assessment. All three phases were completed by the raters using their personal laptop computers in a quiet, private room within their household. Videos used in all three study phases were from the criterion ratings created by the consensus panel VASES development portion of this study.

During the Pre-Training Assessment, novice raters were presented with 25 FEES video clips, with 10 video clips repeated for intra-rater reliability analysis, for a total of 35 video clips. For each video clip, novice raters were instructed to: (1) watch the entire video clip in real-time and then use slow motion frame-by-frame viewing as needed; (2) refer to the anatomic boundaries reference picture (Fig. 1); (3) rate the most amount of residue seen on anatomic structures from any new bolus material, up until the end of the swallow; and (4) rate the highest PAS score that occurred throughout the entire videoclip. Each novice rater was provided with a copy of the PAS (Table 2) [46]. All novice raters were instructed to complete the ratings within a single sitting.

Fig. 1
figure 1

Picture of the anatomic landmarks provided during pre- and post-training

Table 2 Penetration Aspiration Scale (PAS)

The VASES Training Phase included five parts completed in sequential order at their own pace over 1 week. Part 1 involved viewing a PowerPoint presentation that outlined the VASES rules and operational definitions (listed below in the Results section). Part 2 involved practicing VASES by viewing and rating five FEES practice video clips (not included in the pre/post assessments). Part 3 involved watching a pre-recorded, 60-min didactic training session between one of the consensus members and a novice rater not involved in this study. Part 4 involved additional VASES practice by viewing and rating another set of five practice video clips. Part 5 involved attending a live, 60-min, 10-person group training session during which the novice raters engaged in a question-and-answer group discussion with one of the consensus panel members to discuss questions related to the VASES rules.

The Post-Training Assessment involved rating the same set of 35 video clips (re-randomized) that were rated Pre-VASES training. The Post-Training Assessment was completed in one sitting 1 week after VASES Training. Following completion of study participation, the novice raters reported the number of hours required to complete the Pre-VASES Training phase, the VASES Training phase, and the Post-VASES Training phase.

Statistical Analyses

Aim 1: VASES Development Frequency distributions were calculated and used to characterize the VASES criterion ratings of the bolus trials measured by the consensus panel for the pre- and post-training assessments.

Aim 2: Feasibility of VASES Training and Implementation Changes in the accuracy of rating FEES using VASES from pre- to post-training were used as the primary method for examining the feasibility of VASES training. Accuracy was measured by examining the absolute difference in VASES scores between the novice raters’ ratings and the consensus panel criterion ratings. Average absolute difference was calculated for each outcome measure. VASES training was considered to be feasible if ≥ 50% of the seven VASES outcome measures demonstrated a post-training increase in VASES accuracy. Wilcoxon signed-rank test was initially used to examine differences in the accuracy of PAS ratings pre- vs. post-training. If data were not normally distributed, then a related-samples sign test was used instead. Paired sample t tests were used to examine differences in residue rating scores. Outliers were defined as values greater than 1.5 times the interquartile range above the 75th percentile and below the 25th percentile. If outliers in score differences between the pre- and post-training were detected, or score differences were not normally distributed, then Wilcoxon signed-rank tests were used. Lastly, if data were not symmetric, then a related-samples sign test was used instead of a Wilcoxon signed-rank test. The level of statistical significance was set to a familywise p < 0.05. A Holm-Bonferroni adjustment was used to correct for multiple comparisons (i.e., seven comparisons—one for each outcome measure). Cohen’s d was used to measure the effect size of training on VASES accuracy for each outcome measure. Effect sizes were interpreted as “small” if 0.2 ≤ d < 0.5, “medium” if 0.5 ≤ d < 0.8, and “large” if d ≥ 0.8 [49].

Intra- and inter-rater reliability, training completion rate, and the average time to complete VASES training were used as secondary measures to assess the feasibility of VASES training. Forty percent of the videos were randomly selected and repeated for analysis by each novice rater during the pre- and post-training assessments to analyze intra-rater. Inter-rater reliability was calculated for each unique pair of novice raters for each outcome measure. These were then averaged across all pairs of novice raters to characterize average inter-rater reliability for each outcome measure pre- and post-training. Two-way, random effects, intraclass correlation coefficients (ICCs) using absolute agreement were used to examine inter-rater reliability for each of the six residue rating outcome measures. Interpretation of ICC was judged to be ‘excellent’ if ≥ 0.90, ‘good’ if between 0.75 and 0.90, ‘moderate’ if between 0.50 and 0.75, and ‘poor’ if < 0.50 [50]. Weighted Cohen’s Kappa with linear weighting (κW) was used to examine intra- and inter-rater reliability for PAS. Interpretation for the κW was judged to be ‘excellent’ if ≥ 0.81, ‘good’ if between 0.61 and 0.80, ‘moderate’ if between 0.41 and 0.60, ‘fair’ if between 0.21 and 0.40, and ‘poor’ if < 0.20 [50]. Residue reliability ratings that were “good” (ICC ≥ 0.61) [39, 51], and PAS reliability ratings that were “moderate” (κW ≥ 0.41) [14, 36, 38, 52,53,54], were used as cut-offs for training feasibility.

Training completion rate was measured by comparing the number of raters who completed the training relative to the number of raters who started the training. A training completion rate of ≥ 90% was training was selected as a cut-off criterion for training feasibility. The average time to complete VASES training was estimated using self-report from each novice rater. Using the 20–25 h typically required to complete MBSImP training as a referent benchmark [55], an average of 25 h or less to complete VASES training was selected as a criterion for training feasibility.

The time to rate each video clip with VASES was used to measure the feasibility of VASES implementation. This was calculated by dividing the total duration needed to complete each pre- and post-training assessment by the total number of video clips each pre- and post-training phase (i.e., 35 video clips each). The time needed to complete each assessment was tracked and self-reported by each novice rater. Spending an average of 5 min or less to view and rate a single video clip using VASES was selected as the cut-off criteria for what was considered to be feasible for clinical use and implementation.

Results

Aim 1: VASES Development

Rules and operational definitions were created following seven open discussion consensus panel meetings. These rules and operational definitions included four primary areas of analysis, including the what, where, how, and when of VASES ratings, as well as additional secondary rules for FEES rating.

‘What’ to Rate

The first broad area of standardization that emerged from the consensus panel discussions addressed what to include in VASES ratings. The PAS was integrated as one of the outcome measures for VASES since it is commonly used in both clinical and research practice to represent impairments in swallowing safety [56, 57]. Residue ratings for six anatomic landmarks were also integrated as outcome measures for VASES. These anatomic landmarks included the oropharynx, hypopharynx, laryngeal surface of the epiglottis, laryngeal vestibule, vocal folds, and subglottis. These structures are frequently seen during FEES [13] and are included in other commonly used FEES rating scales [29,30,31, 33, 34, 38]. Residue ratings in/on these anatomic landmarks also represent unique impairments in swallowing efficiency (pharyngeal residue) and swallowing safety (penetration and aspiration). Of note, the oropharynx and hypopharynx were used as anatomic boundaries rather than the more typical divisions of the valleculae and piriforms to capture all potential pharyngeal residue not otherwise contained within the valleculae and piriform anatomic landmarks (e.g., on the base of tongue or posterior pharyngeal wall).

‘Where’ to Rate

The second area of standardization that emerged from the consensus panel discussions involved establishing where to delineate exact anatomic landmark boundaries. It was determined that the development of clearly defined anatomic boundaries was necessary for distinguishing oropharyngeal residue from hypopharyngeal residue and for distinguishing between various depths of penetration and aspiration. While many currently available scales provided general descriptions of which anatomic landmarks should be rated for analysis, none of these scales described in detail how to systematically delineate one anatomic landmark from another. Therefore, methods to anatomic boundaries were identified, discussed, and agreed upon by the consensus panel.

Oropharynx-Hypopharynx Anatomic Boundary For the purposes of VASES measurement, the anatomic boundary between the oropharynx and the hypopharynx was first established by identifying the points where the left and right aryepiglottic folds each merge into the laryngeal surface of the epiglottis (i.e., where the medial edge of the aryepiglottic fold is no longer visible). Then, an imaginary horizontal line was drawn across the entirety of the screen connecting these two points (Fig. 2). The space anterior–superior to this line, but not within the laryngeal vestibule, was considered to be the oropharynx, while the space posterior-inferior to this line, but not within the laryngeal vestibule, was considered to be the hypopharynx. An estimation of the imaginary line is extended along the lateral and posterior pharyngeal wall to demark the oro/hypoharyngeal boundary along the pharyngeal wall. The valleculae was defined as the three-dimensional space extending from the tip of the epiglottis (where the lingual and laryngeal surfaces meet) horizontally across to the base of tongue at a perpendicular angle. The piriforms (including the lateral channels) were defined as the three-dimensional space extending from the superior-medial border of the aryepiglottic folds and arytenoids horizontally across lateral and posterior pharyngeal wall at a perpendicular angle.

Fig. 2
figure 2

Anatomic boundary for the oropharynx and hypopharynx

Laryngeal Surface of the Epiglottis Anatomic Boundary The anatomic boundary between the laryngeal surface of the epiglottis and the laryngeal vestibule was first established by identifying the same two points where the left and right aryepiglottic folds blend into the laryngeal surface of the epiglottis described above. Then, the middle of the trough between the epiglottic petiole and the free end of the epiglottis was identified. Lastly, an imaginary curved line was drawn connecting the points and the trough (Fig. 3). The area anterior–superior to this boundary was considered to be the laryngeal surface of the epiglottis, while the area inferior to this boundary (to the point of the vocal folds) was considered to be the laryngeal vestibule.

Fig. 3
figure 3

Anatomic boundary for the laryngeal surface of the epiglottis

Laryngeal Vestibule Anatomic Boundary The anatomic boundary for the laryngeal vestibule was bounded anteriorly superiorly by the laryngeal surface of the epiglottis (described above), and laterally posteriorly by the medial surface of the aryepiglottic folds, arytenoids, and inter-arytenoid tissue (Fig. 4) [52, 58, 59]. Given the ambiguity and inconsistency in rating penetration [2, 46, 60,61,62,63,64,65,66], and to maintain generalizability with VFSS, we defined penetration as bolus entering the laryngeal vestibule (PAS 2–3), whereas bolus only on the laryngeal surface of the epiglottis but not within the laryngeal vestibule was not considered penetration (PAS 1).

Fig. 4
figure 4

Anatomic boundary for the laryngeal vestibule

Vocal Fold Anatomic Boundary The anatomic boundary for the vocal folds included the laryngeal ventricles superiorly, the lateral-most border of the vocal folds, the cartilaginous portion of the vocal folds posteriorly, and the inferior-most border (i.e., lower lip) of the vocal folds inferiorly. The cartilaginous portion of the vocal folds was determined by extending a line from the superior surface of the vocal fold processes posterior around to the inter-arytenoid tissue (Fig. 5). Bolus contained within the boundaries of the vocal folds was considered to be penetration to the level of the vocal folds (PAS 4–5) [46]. Aspiration was considered to be present only when bolus crossed the inferior-most border of the vocal folds and into the subglottic space.

Fig. 5
figure 5

Anatomic boundary for the vocal folds including the laryngeal ventricles, superior surface of the vocal folds, and medial edge/lower lip of the vocal folds

Subglottis Anatomic Boundary The anatomic boundary for the subglottis included the subglottic shelf, cricoid ring, and trachea (Fig. 6). The subglottic shelf was bounded superiorly and laterally by the inferior-most border of the vocal folds and inferiorly by the superior-most aspect of the cricoid ring. The subglottic shelf extended from the distal (inferior-most) point of the vocal folds down to the proximal (superior-most) border of the cricoid ring. The trachea extended from the distal (inferior-most) border of the cricoid ring down to the carina (if visualized).

Fig. 6
figure 6

Anatomic boundary for the subglottis including the subglottic shelf, cricoid ring, and trachea

‘How’ to Rate

The third broad area of standardization that emerged from the consensus panel discussions included how to complete VASES ratings. Initially, the use of ordinal scales was proposed as a method to rate residue. However, the literature review and open panel discussion revealed three potential drawbacks related to the use of ordinal scales for VASES residue ratings. The first drawback was the lack of an agreed upon number of severity levels. For example, the Pooling score (P-score) contains only three severity levels [33], the Boston Residue and Clearance Scale (BRACS) [31] and the Dynamic Imaging Grade of Swallowing Toxicity for FEES (DIGEST-FEES) [30, 38] contain four severity levels, the Yale Pharyngeal Residue Severity Rating Scale (YPRSRS) contains five severity levels [29], and the Mansoura Fiberoptic Endoscopic Evaluation of Swallowing Residue Rating Scale (MFRRS) [34] contains seven severity levels. The second drawback was the lack of agreement regarding definitions for residue amount at each severity level. For example, minimal residue for the YPRSRS is equal to 5–25% filled, where minimal residue is < 50% filled for the P-score, < 33% filled for the BRACS, and < 10% remaining for the DIGEST-FEES. The third drawback was the emerging evidence suggesting that residue severity levels differ across bolus consistencies [41], which would necessitate the development of different ordinal rating scales to accommodate ratings of different bolus consistencies.

100-Point Visual Analogue Scale To address the aforementioned drawbacks, the consensus panel adopted a 100-point visual analogue scale to estimate the amount of residue filling or covering each anatomic structure, rather than a general impression how “severe” the residue was perceived to be. The visual analogue scale was used to provide individual residue ratings for each of the six anatomic landmarks. The left-most point of the scale indicated 0% of the anatomic landmark of interest was filled/covered with residue, whereas the right-most point of the scale indicated 100% of an anatomic landmark was filled/covered with residue (Fig. 7). A continuous 100-point scale has the unique advantage of being able to be integrated into pre-existing ordinal rating scales. Research by Pisegna et al. has also identified that visual analogue residue scales yield higher levels of sensitivity and reliability among clinicians when compared to ordinal residue rating scales [39,40,41]. Lastly, 100-point visual analogue scales have interval statistical properties, which provide greater statistical power when assumptions are met and can be flexibly adapted to non-parametric alternatives in situations where assumptions are not satisfied.

Fig. 7
figure 7

Example of the visual analogue scale range from 0 (none) to 100% (complete). The central black point (set currently to 50/100) can be moved along the scale

Anatomically Defined Residue Estimations The consensus panel chose to adopt an anatomically defined residue estimation rating method, as opposed to a bolus clearance estimation rating method, given that the majority of dysphagia clinicians [13] and FEES residue rating scales [29, 31, 33] currently use this rating method (Table 3). Residue ratings for the oropharynx and hypopharynx involved first estimating the amount of residue present anywhere within the oropharyngeal and hypopharyngeal spaces. Then the amount of this residue was expressed as an estimated percentage of how full the valleculae and/or piriforms would be if all of the residue was collected into each of these spaces. Coating of residue on the oropharyngeal or hypopharyngeal mucosa without visible pooling resulted in a residue rating of ≤ 3%. Residue ratings for the laryngeal surface of the epiglottis, laryngeal vestibule, and vocal folds involved estimating the amount of mucosa covered with residue and then expressing that as a percentage of the total surface area (visualized and non-visualized) for each anatomic landmark (Fig. 8). Residue ratings for the subglottis were made by estimating the total amount of residue seen covering the subglottic shelf, cricoid ring, and trachea, and then expressing that as a percentage of the subglottic shelf. Only residue that was directly observed, but never inferred, was included in a rating.

Table 3 Residue rating method
Fig. 8
figure 8

Illustrated examples of blue residue covering the laryngeal surface of the epiglottis, with residue covering 0% (top left), 3% (top right), 45% (bottom left), and 76% (bottom right)

‘When’ to Rate

The fourth broad area of standardization that emerged from the consensus panel discussions involved when to rate each of the seven VASES outcome measures. It was hypothesized that this level of standardization was necessary given that FEES allows for continuous visualization of successive bolus trials, often with several minutes of visualization for each bolus trial and the presence and severity of pharyngeal residue, penetration, and aspiration can change as a function of time [60,61,62,63]. Because of the influence of time on swallowing efficiency and safety outcome measures, establishing temporal boundaries within which to rate the PAS and residue ratings was warranted.

A literature review revealed four temporal markers within which to rate residue and PAS: before the swallow, during the swallow, after the swallow, and between bolus trials. Taking this literature into account, the following temporal definitions were determined.

  • “Before the swallow” began when new foods or liquids entered the mouth and ended at the onset of uninterrupted laryngeal elevation and pharyngeal contraction, leading to the swallowing-related endoscopic whiteout [64, 65].

  • “During the swallow” began at the onset of uninterrupted laryngeal elevation or pharyngeal contraction, continued through the period of swallowing-related endoscopic whiteout (when present), and ended when the pharynx and larynx returned to their lowest resting position [52, 63]. In the event of multiple swallows per bolus, ‘during the swallow’ began at the onset of uninterrupted laryngeal elevation or pharyngeal contraction for the first swallow and ended when the pharynx and larynx returned to their lowest resting positions after the final swallow.

  • “After the swallow” began when the pharynx and larynx first returned to their lowest resting positions [52, 63] and ended with the first of any of the following three temporal markers: (1) five seconds of inactivity or rest breathing; (2) patient vocalization (either spontaneous or cued); (3) advancement of the scope towards the laryngeal vestibule.

  • “Between bolus trials” included the period of time between “after the swallow” of one bolus trial and “before the swallow” for the subsequent bolus trial.

Residue ratings for the six anatomic landmarks were intended to capture the amount of residue present immediately after the offset of “after the swallow.” Therefore, residue was judged before residue flowed from one anatomic landmark to another and before additional cued or spontaneous swallows or coughs were elicited.

Changes in the size and shape of the oropharyngeal and hypopharyngeal spaces can alter the perception of residue severity [66]. For example, the amount of residue may appear greater when the piriforms appear smaller (e.g., when the vocal folds are abducted) compared to when the piriforms appear larger (e.g., when the vocal folds are adducted). Therefore, residue ratings for the oropharynx and hypopharynx were made when the valleculae and piriform sinuses each appeared their largest (e.g., during tidal exhalation or during a sustained phonation task).

Lastly, the single highest PAS score seen across all four temporal markers was used for ratings for each bolus. For example, if vocal fold residue (PAS 5) was initially identified after the swallow, but silent aspiration (PAS 8) was seen between bolus trials from post-swallow residue, then a PAS 8 was given. Conversely, if silent aspiration (PAS 8) was seen immediately after the swallow, leading to a delayed cough between bolus trials (PAS 7), then a PAS 8 was given.

Secondary Rules

Following consensus panel VASES ratings, several secondary rules were developed.

  1. (1)

    Only judge residue that appears new and is the same color and consistency as the bolus that was presented.

  2. (2)

    In instances where residue and secretions are mixed together, ignore the secretions and only estimate the amount of residue present.

  3. (3)

    Raters should explicitly indicate if/when ratings were not made until after coughs or after additional clearing swallows due to incomplete endoscopic visualization, or else decide to systematically exclude such bolus trials from the reported ratings.

  4. (4)

    Residue that is on, but not crossing, the border of the epiglottis and laryngeal vestibule should be interpreted only as residue that is on the epiglottis and not residue that is in the vestibule.

  5. (5)

    Residue that is on, but not crossing, the border of the vocal folds and subglottis should be interpreted only as residue that is on the vocal folds and not residue that is in the subglottis.

  6. (6)

    Residue ratings should ideally be made with a high degree of certainty. If the material being rated could be ‘reasonably’ interpreted as secretions, shadow, or residue from a previous swallow, then it should not be rated as new residue from the current swallow of interest (Fig. 9).

Fig. 9
figure 9

Example of secretions/no vocal fold residue (left), blue residue on the vocal folds (middle), and “bright white” residue on the vocal folds (right)

Consensus Panel Ratings

Consensus panel ratings for the 55 video clips were derived from 15 full-length FEES videos. FEES examinees included ten males and five females with an average age of 69.0 years (SD 8.1). All examinees had a diagnosis of Parkinson’s disease, with an average disease duration of 10.2 years (SD 5.7) since symptom onset. Additionally, they had an average SWAL-QOL scaled score of 67.9% (SD 17.9%) [67], indicating moderately impaired swallowing-related quality of life, and a median DIGEST-FEES score of 2 (IQR 2–3; Range 0–4) [30, 38], also indicating moderate dysphagia. Twenty-five video clips were randomly selected from the above FEES and used for assessment of pre- and post-training accuracy. A distribution of the criterion rating for these video clips is outlined in Fig. 10.

Fig. 10
figure 10

Frequency distribution of the criterion ratings of the seven VASES outcome measures across the 25 FEES video clips. Abbreviations: visual analogue scale ratings (VAS); Penetration Aspiration Scale (PAS)

Aim 2: Feasibility of VASES Training

Pre- vs. Post-Training Accuracy in VASES

Pre-training, novice raters had a tendency to overestimate (rather than underestimate) residue in/on the anatomic landmarks when compared to the criterion ratings. Specifically, 78.1–93.6% of the ratings overestimated residue across the six anatomic landmarks, while only 6.3–21.9% of the ratings underestimated residue across the anatomic landmarks. Additionally, of the incorrect pre-training PAS ratings, 35% had higher ratings and 65% had lower ratings when compared to criterion ratings. Post-training, the types of errors made by the novices were more evenly distributed, with 30.0–53.6% of the ratings having overestimated residue across landmarks, and with 46.3–70.0% of the ratings having underestimated residue across landmarks. Additionally, of the incorrect post-training PAS ratings, 36% had higher ratings and 64% had lower ratings when compared to criterion ratings.

The average absolute error decreased by 48.1–79.6% across the six residue ratings (Fig. 11) while the proportion of incorrect PAS scores decreased by 26.8% (Fig. 12). Results revealed that these were statistically significant improvements (Table 4). Furthermore, medium-to-large effect sizes were observed for the laryngeal vestibule (d = 0.74) and subglottic residue ratings (d = 0.74), while large effects were observed for the remaining five outcome measures (d range 0.99–1.59).

Fig. 11
figure 11

Average absolute error in residue ratings across anatomic landmarks for each novice rater pre- and post-training, with lower scores indicating greater accuracy

Fig. 12
figure 12

Proportion of incorrect penetration-aspiration scale (PAS) scores for each novice rater pre- and post-training, with lower proportions indicating greater accuracy

Table 4 Pre- and post-training changes in VASES accuracy

Secondary Feasibility Measures of VASES Training and Implementation

Reliability Intra-rater reliability improved from “good” to “excellent” for the laryngeal vestibule residue rating, worsened from “good” to “moderate” for the subglottis residue rating, and was unchanged (“good”) for the remaining four residue ratings and the PAS. Inter-rater reliability improved from “fair” to “moderate” for PAS, improved from “good” to “excellent” for the hypopharyngeal and laryngeal vestibule ratings, and was unchanged for the oropharynx (“good”), laryngeal vestibule (“excellent”), and subglottis (“good”) (Table 5).

Table 5 Pre- and post-training reliability

Training Completion Rate One of the 26 enrolled novice raters dropped out of the study after beginning the VASES training protocol due to scheduling concerns. The remaining 25 novice raters successfully completed all VASES training, yielding a completion rate of 96.1%.

Time to Complete Training Median number of hours to complete VASES training was 6 h (IQR 4.75–10 h). The minimum time to complete VASES training was 4 h (n = 6). Only two raters reported needing more than 10 h to complete VASES training: one of the two raters reported needing 12 h for completion and the other rater reported needing 20 h for completion.

Time to Rate Each Video Clip The videoclips in the Pre- and Post-Training assessments were an average of 1.1 min (SD 0.5). The average self-reported time spent per bolus trial was 3.7 min (SD 1.0) Pre-Training and 2.6 min (SD 1.0) Post-Training. After accounting for the average length of each video, it took an average of 2.6 min to rate each bolus Pre-Training and 1.5 min to rate each bolus Post-Training.

Discussion

While many rating scales have been developed for and adopted to FEES, clearly defined anatomic and temporal boundaries within which to rate pharyngeal residue, penetration, and aspiration have been lacking. This ambiguity in rating methodology can potentially limit generalizability of research findings and can negatively impact exam reliability. VASES was developed to address these gaps and improve standardization of FEES ratings. Through open panel discussions and a comprehensive literature review, the VASES rating methodology was developed. By describing what, where, when, and how to rate pharyngeal residue, penetration, and aspiration during FEES, clinicians and researchers can begin to measure swallowing efficiency and safety in a more standardized manner.

For a standardized rating method to be widely adopted into clinical and research practices, it must be feasible to train and implement. The primary method for determining if VASES was feasible to train was to examine pre- to post-training differences in the accuracy of VASES ratings. The results from this study demonstrated significant post-training improvements in rating accuracy across all seven VASES outcome measures. However, one limitation to keep in mind was the lack of a “no training” control group. Without a control group, it remains unclear how much of the improvement in rating accuracy was due to training as opposed to general practice of FEES interpretation. However, given that intra-rater reliability was relatively unchanged, but accuracy and inter-rater reliability were improved, current findings suggest that this group of novice raters used a more similar and standardized rating approach post-training compared to pre-training.

From a training feasibility standpoint, we also examined training completion rate, time to complete training, and post-training changes in the accuracy and reliability of ratings. The training completion rate in this group of novice raters was relatively high (96.1%), and the median time to complete the training was relatively short (6 h), both meeting the cut-off criteria for training feasibility. While 6 h for training may be considered time-consuming by some, it is substantially shorter than the 25 h reported by other popular swallowing training programs [55].

Inter- and intra-rater reliability of all of the VASES outcome measures also met the cut-off for training feasibility, with the exception of PAS which was observed to be “fair” (inter-rater) and “moderate” (intra-rater) post-training. This may have been due in part to the limited number of aspiration events (20%) included in the randomly selected videos for intra-rater reliability assessment. Conversely, this could be an indicator that additional training for PAS may be warranted. It is noteworthy to mention that while PAS reliability was “fair” to “moderate”, these values are similar (or greater) to PAS reliability results often reported in the dysphagia literature using similar statistical methods [37, 52,53,54, 68, 69].

A second important consideration for developing a standardized rating method for FEES is that the rating method must be feasible to implement into clinical practice. This important question is a direction of future research which relies, in part, on identifying facilitators and barriers for implementing standardized rating methods for FEES into clinical practice. While the primary aim of the present study was not designed to examine the feasibility of VASES implementation, the self-reported time to rate a video clip was used to begin to address this clinically relevant question. Reviewing audio–video FEES recordings is an expected and critical component of the evaluation process, is required for FEES billing purposes within the USA [70], and can significantly influence the reliability and accuracy of FEES interpretation [71]. Findings from the current study revealed that, after training, novice raters took an average of 1.5 min to rate each bolus trial. This met our cut-off threshold of 5 min, suggesting that implementing VASES into clinical practice may be feasible. Furthermore, previous research demonstrates that the time to rate each bolus should decrease over time with continued practice [72]. Therefore, depending on the number of bolus trials and the amount of time logistically allowed for review of the FEES, swallow-by-swallow analysis of FEES using VASES appears largely feasible to implement. Feasibility of implementation may be further increased if analyzing only a select few swallows—an approach often used in clinical practice with other swallowing evaluation protocols (e.g., MBSImP; Dynamic Swallow Study) [64, 73].

It is interesting to note that good to excellent intra- and inter-rater reliability was observed within this cohort of novice raters prior to VASES training. This high level of intra- and inter-rater reliability may be attributed in part to the use of visual analogue scales. This is consistent with the research by Pisegna et al. who identified good-to-excellent reliability when rating pharyngeal residue with visual analogue scales, regardless of experience level [39]. The provision of a standardized anatomic landmark referent image, despite not being provided with the methodology to form standardized anatomic boundaries, may have also contributed to the high level of baseline intra- and inter-rater reliability. Furthermore, reliability appeared relatively unchanged (intra-rater) or improved (inter-rater) after VASES training. This demonstrates that adding the VASES rules and operational definitions does not negatively impact the high level of reliability seen at baseline with use of visual analogue scales and an anatomic boundaries referent image.

Several points of consideration and areas of future research should be considered when interpreting the findings of this study. First and foremost, the present study did not examine the content validity of VASES. Instead, it was intended to describe the development of VASES and to explore its feasibility of use. Given that VASES has now been developed and training has been shown to be feasible, next steps should include examining content validity by comparing VASES to other validated FEES rating scales, including the YPRSRS, BRACS, or DIGEST-FEES. Validation work should also consider exploring the use of VASES in multiple patient populations beyond neurodegenerative disease and VASES sensitivity for tracking changes in swallowing function over time. Second, as part of the VASES training, novice raters engaged in a group Q&A session. By nature, this Q&A session involved some variability in training instruction, potentially limiting the generalizability of training effects. It remains unknown how accuracy could have improved with only providing the written rules (i.e., those outlined in this document) or with a standardized set of training documents (see supplemental materials) and not engaging raters in didactic training which can limit potential generalizability. Therefore, another area of future research should consider comparing differences in training methods (e.g., with vs without Q&A sessions). Third, the current study only evaluated the effects of VASES training on novice raters. It is possible that training effects may not generalize to raters with greater FEES experience. Therefore, future research should consider examining if experienced raters can improve the accuracy of VASES ratings following training, and to compare training effects between raters with differing levels of experience. Lastly, VASES is intended to be a standardized framework with which to rate and interpret FEES. It may evolve over time and may be adapted as needed to meet the needs of patient-specific populations. Scales and rating methods which incorporate anatomically defined rules into its protocol (e.g., BRACS, Yale, MBSImP, PAS, VASES) may need to be altered in situations where patients present with altered anatomy (e.g., secondary to cancer or post-operative surgical changes). Adjusting the VASES methodology in these situations is expected. However, if the VASES rating method is adjusted (e.g., by rating left and right oropharynx and hypopharynx separately or by omitting certain outcomes), then these alterations should be specifically reported to maintain transparency, reliability, and standardization.

Conclusions

The Visual Analysis of Swallowing Efficiency and Safety (VASES) is a standardized rating methodology used to enhance the evaluation of pharyngeal residue, penetration, and aspiration on FEES. VASES facilitates good to excellent intra- and inter-rater reliability of FEES analysis in novice raters, can be feasibly taught with high levels of success to novice raters, and may be an effective method to analyze residue, penetration, and aspiration in clinical and research practices. Future research is needed to determine the validity of this method by examining the relationship between VASES ratings with other validated FEES rating scales.