Introduction

Alterations in laryngo-pharyngeal mechano-sensitivity (LPMS) induced either by hypo- or hypersensitive states are common disorders of the upper airway. Such alterations have been reported in dysphagia [1], obstructive sleep apnoea (OSA) [2], and cough due to laryngeal hypersensitivity [3, 4].

Dysphagia affects approximately 8.4% of the general population [5] and is a risk factor for aspiration pneumonia [6], which has an high burden on morbidity and mortality [7, 8]. OSA is a major cardiovascular risk factor having a prevalence above 10% [9] and a high impact on mortality [10]. On the other hand, cough due to laryngeal hypersensitivity may have a prevalence of 6% [11] and causes a significant impairment in health-related quality of life [12].

Moreover, there are experimental interventions that appear to improve hypo- and hypersensitivity problems in the upper airway [3, 1315], but objective and reproducible methods of assessing the true efficacies of these interventions are not currently available.

Aviv developed a trans-laryngeal air-pulse stimulator [16, 17]. Unfortunately, the positive intra-rater reproducibility of his test has not been reproduced among less expert raters (i.e. otolaryngology junior resident [18]), and the inter-rater reproducibility is poor [18]. Hammer developed an air-pulse stimulator with improved air-pulse pressure and duration reliabilities [19]. However, reliability studies of Hammer’s device have been scarce.

Recently, a new laryngo-pharyngeal esthesiometer (LPEER) was developed, which includes an air pulse generator and an endoscopic laser rangefinder, with the goal of resolving the reliability problems of previous devices [20]. The study protocol also included measurements of laryngeal-adductor reflex threshold (LART), cough reflex threshold (CRT), and gag reflex threshold (GRT) [20, 21]. Preliminary tests revealed promising results, but we do not know whether the technological aids of the LPEER are sufficient to improve the reliability of the LPMS evaluations.

To validate the LPMS evaluations using the LPEER, we compared the results of expert and novel raters in a prospective cohort of patients. We determined the median values of the laryngo-pharyngeal reflex thresholds of aspiration patients and healthy patients and defined cut-off points at which measurement errors should be suspected when performing the test.

Materials and methods

Study population

We prospectively and consecutively recruited a cohort of patients to assess the reliability of the LPEER from two tertiary care university hospitals. The inclusion criteria were a patient age of 18 years or older and stroke with oropharyngeal dysphagia; volunteers without dysphagia served as controls. The exclusion criteria were respiratory failure, bleeding diathesis and anticoagulant therapy. The criteria for removing a recruited patient from the study were any degree of epistaxis or severe discomfort.

The institutional review board of each recruitment centre approved the protocol, all participants provided written informed consent, and the study adhered to good clinical practices. Patient enrolment was performed from December 30, 2013 through September 19, 2014.

Tests

The patients underwent a standard clinical evaluation by a Speech Language Pathologist (SLP) with 7 years of experience in dysphagia [21]. The clinical evaluation included a validated Spanish version of the EAT-10 [22], the determination of the Rankin Scale score [23] for stroke patients and administration of the Glasgow Coma Scale for patients with any abnormal level of consciousness [24].

The LPMS evaluation consisted of measurements of the LART, CRT and GRT by an endoscopist (rater) [21, 25] using the LPEER, which was connected to a conventional fibre bronchoscope (Pentax FB-10V, Pentax of America, Montvale, NJ, USA) with a working channel of 1.2 mm internal diameter. The bronchoscope was also connected to a video system (Pentax PSV-4000, Pentax of America, Montvale, NJ, USA), a light source (Pentax LH-150PC, Pentax of America, Montvale, NJ, USA) and a computer (Samsung RV420 Core-i5, Samsung Electronics, Suwon, South Korea) for image processing and recording of the entire test [20]. The bronchoscope was lubricated with a water-soluble gel (without any anaesthetics), introduced through the nasal cavity of the patient placed in a seated or semi-recumbent position and advanced into the pharynx. Each rater determined the sensory thresholds at predetermined points of the laryngo-pharyngeal tract: the LART and CRT were measured at the aryepiglottic fold at a point between the corniculate and cuneiform cartilages, while the GRT was explored at the lateral wall of the pharynx at a point lateral to the epiglottis. In preliminary observations [20], it was noted that these particular sites elicit more consistent reflexes. The details of the reflex threshold determinations are published elsewhere [21].

The LART was measured via a series of air pulses of 100 ms duration that decreased in intensity from 0.7 to 0.04 mN [16, 17, 19, 21]. The CRT and GRT were explored via a series of air pulses of 1000 ms duration that increased in intensity from 0.8 to 16.5 mN. Each reflex threshold was defined as the minimum air-pulse intensity that elicited the corresponding reflex. When the repeated reflex threshold measurements were different, the true threshold was the lowest air-pulse stimulus that elicited such a reflex.

We measured air-pulse intensity in mN rather than mmHg due to the geometric characteristics of air pulses, and to compare LPMS with the esthesiometry of other organs [20, 26, 27].

For the LART measurement, the air pulses were set to decrease in intensity because in a preliminary group of subjects [20], it was found that starting with a supra-threshold stimulus helped identify the normal reflex, which otherwise might go unnoticed by a non-experienced rater. The range of stimulus intensities used for the LART measurement was well tolerated and did not induce patient discomfort, except in those with laryngeal hypersensitivity. For members of the latter group who coughed or gagged when stimulated with air pulses of 0.7 mN, we started the air pulse series for the LART measurement at 0.4 mN.

The GRT and CRT were measured by administering air pulses that were set to increase in intensity. These reflexes are very clear and easily detected by any rater; setting the air pulse series to measure them at increasing intensities allowed the rater to stop administering the air pulses once the gag or cough reflex was elicited, to decrease patient discomfort produced by repetition of these reflexes.

An expert and a novel rater determined the LART, CRT and GRT during the same endoscopic procedure but at different moments, sequentially and randomly. We used a table of random numbers to establish this order such that each rater was first and second for similar numbers of evaluations. Raters performed two measurements of each reflex threshold on the right and left side of the laryngo-pharyngeal tract in each patient in the following order: GRT, CRT, and LART. The expert rater was a pulmonologist with 9 years’ experience in the fibre-optic endoscopic evaluation of swallowing with sensory test (FEESST); this rater performed the LPMS evaluation with the 67 patients who were finally included in this study (34 with dysphagia due to stroke and 33 without dysphagia; see Table 1). These 67 patients also received the LPMS evaluation from a novel rater, who was a physician with less than 6 months of experience with the FEESST. We had several physicians serving as novel raters: a pulmonologist or pulmonary fellow for seven patients, a general practitioner for six patients or a medical intern undertaking a 6-month rotation in pulmonary medicine for 54 patients. The novel rater received a 1 month training including theoretical and practical sessions (using manikins and humans), upper airway endoscopy and FEESST training. During measurement, the air pulses were identified by a number instead of the air-pulse force (for blinding purposes), and the numbers were blindly replaced by the air-pulse force at the end of the study.

Table 1 General characteristics of the cohort

After the sensory evaluation the expert rater with the assistance of a SLP performed a fibre-optic endoscopic evaluation of swallowing (FEES) in all patients, which was considered the reference standard for oropharyngeal dysphagia (the study population was divided according to FEES in patients with stroke and dysphagia and patients without dysphagia). During FEES, patients were tested with four food consistencies (i.e. pure, thick liquid, solid, and thin liquid) according to the standard FEES protocol [21, 25]. All food was green coloured for contrast.

The safety of swallowing was evaluated during the FEES by monitoring for penetration (i.e. the entrance of material into the laryngeal vestibule), residues (i.e. the presence of material on the pharynx after swallowing), aspiration (i.e. the entrance of material below the vocal cords) and premature spillage (i.e. the premature passage of food from the oral to the pharyngeal cavity). The severity of alteration in swallowing was rated based on the consensus of the expert rater and the SLP according to an 8-point penetration-aspiration scale [28, 29] and the dysphagia severity scale (DSS) [21, 30, 31].

Patients rated the pain, nausea, headache and discomfort experienced during the FEESST on a scale from 0 to 10, with 0 corresponding to the absence of symptoms and 10 corresponding to the maximum intensity of the symptom that the patient had ever experienced.

Statistical analysis

We performed Kolmogorov–Smirnov tests to determine whether the quantitative variables were normally distributed. We compared the normally distributed variables with t-tests and compared the non-normally distributed variables with Mann–Whitney U tests. For all of the statistical analyses, differences were considered as statistically significant at P < 0.05 (two-tailed).

To assess the LART, CRT and GRT reliabilities, we calculated the intra- and inter-rater intraclass correlation coefficients (ICCs) and Spearman correlation coefficients (SCCs) with their 95% confidence intervals (95% CIs). We evaluated intra- and inter-rater agreement using Bland–Altman plots of the limits of agreement with their 95% CIs. The inter-rater comparisons were performed by contrasting the measurements of the expert and novel raters.

For the sample size calculation, we used the equation proposed by Bonnet for the ICC [32]. Based on this equation, for an ICC ≥0.7 with a 95% CI width ≤0.25, we required 66 patients.

The statistical analyses were performed using IBM SPSS statistical software, version 20 (Armonk, NY, USA), MedCalc version 14.12.0 (MedCalc Software bvba, Ostend, Belgium), and Microsoft Excel 2007 (Microsoft Corporation, Redmond, WA, USA).

Results

We evaluated 124 patients. Thirty-seven did not meet the inclusion criteria due to dysphagia secondary to conditions other than stroke or an age below 18 years. Two patients met the exclusion criteria: one due to anticoagulant therapy, and one due to respiratory insufficiency secondary to amyotrophic lateral sclerosis. Twelve patients declined consent to participate, and six were excluded during testing (four due to unobtainable thresholds due to continuous movements of the laryngo-pharyngeal tract, one due to severe discomfort and one due to epistaxis). Thus, 67 patients were ultimately included in the analysis (Fig. 1).

Fig. 1
figure 1

Enrolment flowchart. Reasons for retirement: discomfort: 1, epistaxis: 1, continuous laryngo-pharyngeal movements impeding threshold exploration: 4

The patients with stroke and dysphagia were older, were comprised a greater proportion of men and had more comorbidities (Table 1). The stroke patients had moderate strokes according to the NIHSS. The overall severity of dysphagia in the stroke patients according to the Penetration-Aspiration scale was moderate (ranging from mild to severe). Half of the stroke patients exhibited penetration, and one-third exhibited aspiration (Table 1).

Each rater performed four measurements per reflex in each patient (two measurements on each side of the laryngo-pharyngeal tract). We performed a total of 24 measurements per patient, yielding 1608 measurements over all 67 patients in the study.

The intra-rater ICCs for all of the reflex threshold determinations were above 0.90, and the inter-rater ICCs were 0.87 for the LART, 0.79 for the CRT and 0.70 for the GRT (Table 2).

Table 2 Intra- and inter-rater intraclass correlation coefficients

The intra-rater SCCs for all of the reflex thresholds were above 0.88, and the inter-rater SCCs were 0.80 for the LART and CRT and 0.70 for the GRT (all P < 0.0001) (Table 3).

Table 3 Intra- and inter-rater Spearman correlation coefficients

The Bland–Altman plots of the limits of agreement revealed mean intra- and inter-rater differences that were close to zero and free of increments in the variabilities at the extremes of the averages, with the exception of a mild trend toward lower values for the GRT in the expert rater measurements compared with those of the novel rater at the lower GRTs (Fig. 2). Ninety-five percent of the inter-rater differences in the LART and CRT were less than the differences in the thresholds between the aspirators and non-aspirators (Fig. 2; Table 4). Regarding the GRT, the intra-rater limits of agreement were lower than the differences in the thresholds between the aspirators and non-aspirators, but the inter-rater limits of agreement were greater (Fig. 2; Table 4).

Fig. 2
figure 2

a Intra-rater LART limits of agreement plot; b inter-rater LART limits of agreement plot; c intra-rater CRT limits of agreement plot; d inter-rater CRT limits of agreement plot; e intra-rater GRT limits of agreement plot; f inter-rater GRT limits of agreement plot. LART laryngeal-adductor reflex threshold, CRT cough reflex threshold, GRT gag reflex threshold, mN millinewtons, SD standard deviation, Exp expert, Nov novel

Table 4 Reflex thresholds according to aspiration status

The patients with aspiration showed higher reflex thresholds; this difference was clinically and statistically significant (Table 4).

The median normal value was 0.14 mN for the LART (IQR 0.11–0.24), 4.44 mN for the CRT (IQR 2.63–7.93) and 11.88 mN for the GRT (IQR 4.44–16.44).

We calculated the 95% CIs of the intra-rater differences in the limits of agreement to establish the upper limits as cut-offs for the identification of outliers; the resulting cut-offs were 0.12 mN for the LART, 3 mN for the CRT and 4 mN for the GRT.

The patients did not report pain during the exam, and the majority rated their discomfort as mild to moderate (median discomfort: 4/10; IQR 3–6). We observed no cases of syncope, pre-syncope or laryngospasm and no requirements for emergency room care or hospitalization due to adverse events.

Discussion

In the present study of LPEER reliability, we observed ICCs that were within the range of excellent reliabilities for all of the reflex thresholds measurements with the exception of the inter-rater GRT; however, this latter measure also achieved substantial agreement [33, 34]. The SCCs for all of reflex threshold measurements were within the range of strong reliabilities [35]. The Bland–Altman limits of agreement plots revealed that the intra- and inter-rater differences were close to zero and free of trends related to the averages and signs of bias with the exception of a mild trend toward lower GRT values in the expert rater measurements at lower GRT values. The widths of the limits of agreement of the LART and CRT were adequate for comparing patients with and without dysphagia.

Our intra-rater ICC and SCC results from the expert rater are similar to those reported by Aviv [17] and better than those of Cunningham [18]; however, we also observed excellent reliability in the results from the novel raters. Furthermore, our results revealed excellent inter-rater reliabilities in the comparisons of the expert and novel raters, which have not been reported with previous devices. We chose two raters at opposite ends of the expertise spectrum to include the full range of variability existing in real clinical practice: the difference of measurement between raters with more closely aligned degrees of expertise would be lower than what we found, yielding even better agreement when using the Bland–Altman limit of agreement method. To gain worse results than ours when deploying said method, a greater difference in rater expertise than ours would be needed, which is unlikely to be found in real clinical practice. These results demonstrate the efficacies of the technological aids that have been introduced to the LPEER [20] in improving the reliability of LPMS evaluation.

It was unclear the extent to which other clinical factors, e.g. the viscosities of secretions and continuous movements of the laryngo-pharyngeal tract, would affect the reliability of LPMS evaluation. However, the excellent reliabilities obtained for the tests that were performed on patients with a wide range of deglutition alterations (Penetration-Aspiration score ranging from 1 to 7) [28] suggest that these other factors were either not clinically relevant or were indirectly controlled for by the LPEER technological improvements [20]. Indeed, the LPEER enables control of the distance and angle of the stimulus delivery over the target surface, similarly to esthesiometers that have been designed for other organs [36].

To the best of our knowledge, this study provides one of the most comprehensive evaluations of the LPMS in terms of reliability, including assessments of the ICCs, SCCs and limits of agreement. We validated a device and method for exploring the LART, CRT and GRT. Our results support the utilization of the LART and CRT in normal patients and patients with dysphagia secondary to stroke. However, the limits of agreement for the inter-rater GRT were greater than the differences between the patients with and without dysphagia. These findings are consistent with those of previous studies reporting a lower reliability of this reflex than those of other reflexes in dysphagic patients [37, 38]. Although the GRT exhibited less utility for the dysphagic patients, our study does not allow us to rule out its utility for diseases involving lower GRTs, such as motor neuron diseases [39, 40] and other laryngo-pharyngeal hypersensitivity states.

All reflexes exhibited significantly higher thresholds in the dysphagic stroke patients, which highlights the sensory compromises that are typical of this condition [37, 41]. Similar findings have previously been reported for the LART [37, 41]. While there have been prior reports of cough reflex compromise in stroke patients assessed with chemical stimuli [42, 43], none have used standard mechanical stimuli like those provided by the LPEER for such assessments. These sensory compromises of the airway-protecting reflexes most likely contribute to the development of dysphagia in stroke patients and to the increased risk of pneumonia in said individuals [44, 45]. Therefore, they could be used to identify patients at high risk of pneumonia who would benefit from interventions to reduce pneumonia incidence, including vaccination, oral care, diet modification and the better selection of patients for gastrostomy. In addition, there are experimental interventions that are undergoing evaluations in clinical trials, with the goal of improving the laryngo-pharyngeal sensory compromise observed in stroke patients to improve their deglutition alterations [13, 46, 47], and these trials could benefit from our quantitative method of LPMS measurement for the evaluation of these interventions’ efficacy.

Our normal LART value was 0.14 mN, this value corresponds to 2.5 mmHg in Aviv’s system of measurement [20, 21]. This normal value is consistent with Aviv’s work [16, 48] and is also comparable to the results of the study by Grushka [27] in which a tactile sensory threshold of 14.9 mg (equivalent to 0.15 mN) was observed at the most sensitive point of the tongue.

To rule out any measurement errors in clinical explorations of the LPMS, each patient should undergo at least two measurements per reflex per side of the laryngo-pharyngeal tract. Whenever differences between the measurements that exceed the outlier limits that we have reported (e.g. 0.12 mN for the LART, 3 mN for the CRT and the GRT) are observed, the rater should suspect measurement error and perform additional measurements.

We did not observe clinically relevant adverse events, and our patients reported only mild to moderate discomfort, which is consistent with previous FEESST safety reports [49].

Conclusion

The explorations of the LARTs and CRTs with the LPEER showed excellent intra- and inter-rater reliability. The GRT exploration exhibited substantial reliability, but the larger width of the inter-rater limits of agreement of this test could limit its application in stroke patients.