Introduction

Operatic solo singing requires electronically un-amplified vocal sound production at sound levels that are suitable to compete with large orchestras and choirs. This kind of voice production for artistic purposes extends the fundamental frequency (fo) range well beyond what is used in human speech communication, where the average fo is at about 120 Hz and 200 Hz for adult males and females, respectively1. The entire singing fo range can only be covered if the different laryngeal production mechanisms available to the human voice—often termed voice “registers”2,3– are utilized.

The two main laryngeal mechanisms are mechanism M1 (also frequently termed the “chest” or “modal” register, and typically used in speech and often in singing) and M2 (also frequently termed “falsetto” or “head” register, mainly used in singing, but sometimes also in speech). The most extreme upper musical pitch range of operatic sopranos—typically sung by adult human females, who are generally known to phonate at higher frequencies than males due to their shorter vocal fold length—extends the voice to an fo range of about 1000–1600 Hz, or about three octaves above the fo of speech. This range, which is regularly accessible to professionally trained classical soprano singers, is acoustically characterized by a strong fundamental and weak overtones in comparison to other laryngeal mechanisms4,5, justifying a classification into a separate mechanism, M3. This M3 voice production mechanism is commonly being called the “whistle register” (German: Pfeifstimme—cf.6) in singing voice pedagogy7. There is no clear definition of the actual range and mechanism of that register: Garnier et al.8 found a transition to the whistle register already between the pitches D#5 and D6, while some authors only speak about a transition at ca. C6, and Titze would only call it whistle register from and above F69. Mechanism M3 is hypothesized to be distinguishable from M2 due to differences (1) in the voice source, i. e., the laryngeal mechanism, (2) in the resonances or tuning strategy, (3) of interactions of the voice source and the resonances. Detailed essays on this can be found elsewhere2,3,10,11. The utilized terminology would suggest an aeroacoustic production mechanism, where a rigid structure combined with certain resonance causes airflow instabilities that produce such high frequencies.

In contrast, humans typically produce voice through the MyoElastic-AeroDynamic (MEAD) principle12,13,14. There, the vocal folds enter a state of self-sustained oscillation. The ensuing medio-lateral oscillation of the vocal folds, successively facilitating partial or full closure of the laryngeal airway, causes cyclic modulation of the exhalatory airflow. Those airflow fluctuations translate to pressure variations which are the main constituent of the generated vocal sound15,16 (see Fig. 1C,D). In particular, the main acoustic excitation event is set up at the moment of airflow cessation during each cycle17, when the vocal folds are maximally approximated in their membranous part, often resulting in full vocal fold collision. The corresponding medio-lateral vibratory mode, resulting in air flow modulation variation, is a crucial requirement for the MEAD mechanism and thus distinguishes it from an aeroacoustic production principle.

Figure 1
figure 1

Schematic overview of Aero-Acoustic (AA) and MyoElastic-AeroDynamic (MEAD) voice production mechanisms. (A) Mid-sagittal view of rodent larynx, illustrating the formation of the impinging jet in AA sound production; (B) schematic display of the lack of tissue vibration in the AA mechanism and the resulting sinusoidal sound source; (C) coronal view of human larynx, illustrating the open and the closed phase of medio-lateral vocal fold vibration; (D) vocal fold displacement pattern and resulting prototypical acoustic voice source22 for the MEAD mechanism.

MEAD is the predominant mechanism of sound production in mammals, extending across a range of body sizes and fo, spanning more than four orders of magnitude from 10 Hz to 120 kHz. However, the murine rodents—which with 1400 species comprise about 25% of all 5400 mammal species—have adapted a completely different physical mechanism of sound production to extend their high frequency vocalization range. Murine rodents—including mice and rats—produce ultrasonic vocalizations (USV) with an aeroacoustic mechanism18. Several aeroacoustic mechanisms have been proposed to explain USVs in rodents that differ in their local flow conditions and acoustic feedback properties: (a) wall-impinging jets19,20 (i.e., focused airflow that strikes an opposing surface—see Fig. 1A,B); (b) edge impinging jets (resulting from successive oscillation of an airflow jet to alternate sides of a ridge that is struck by the airflow)21 and (c) cavity whistles (generated through air vortex oscillation within the cavity)20. In laboratory rats and mice, the wall-impinging whistle drives USV production, but alternative mechanisms may be found in the large number of rodent species19,20. Interestingly, in these species, the respective USV production mechanism co-exists with the “conventional” MEAD mechanism, the latter being exclusively used for vocalization at lower and thus humanly audible fo.

Given the highly conserved laryngeal anatomy across mammals23, it might be possible that the operatic sopranos vocalizations at very high fo are also produced by a special aeroacoustic mechanism. In point of fact, a number of authors describe the general possibility for an alternative aeroacoustic sound production mechanism in humans. This was hypothesized to be achieved by means of “chink tones” analogous to whistling24 in the larynx and subsequent cavity resonance”12, or vortex-induced vibration of the folds25,26,27,28, possibly involving interactions between the voice source and the vocal tract29,30. Such mechanisms are sometimes referenced to as M4 or “glottal whistle” for the fo range of 1–3 kHz and above31. In such an aeroacoustic sound production mechanism it could be expected that the frequency of the whistle fm1 becomes fo. Other studies showed that high-pitched human vocalization can be produced with the MEAD mechanism5,6,8,28,32,33,34,35. However, all those studies come with certain limitations: They either (a) had a limited number of participants for laryngoscopic examination, i. e., n = 18,32,33,34, n = 235; n = unknown28; (b) employed a limited data acquisition methodology which was either indirect, using electroglottography5, or did not allow direct observation of the vocal folds along their entire antero-posterior length32; and/or (c) had a limited temporal resolution (with the Nyquist frequency below fo), thus resulting in aliasing and preventing adequate time-resolved within-cycle documentation of the sound production mechanism6,8,34. Even though there is evidence for both principles, it remains uncertain which sound production mechanism and configuration are responsible for operatic singing in a range of fo that resides at the bottom of the so-called whistle register range, which typically exceeds fundamental frequencies of 1000 Hz. In addition, a confusion regarding the terminology and the associated mechanisms and application in different music genres remains widespread up until today. In this study, we address the issue by providing the first comprehensive documentation of high-pitched soprano singing with super-HSV at 20,000 fps, investigating a larger cohort of professional operatic sopranos.

Results

First, we tested whether the wall-impinging jet model developed for rodents20 also applies to human voice production anatomy and physiology. Notably, our data (Fig. 2) suggest that such an aeroacoustic mechanism is hypothetically possible for high-pitched female operatic singing. The three independent parameters of that model—i. e., glottal area, impingement length (the length of the airflow jet that strikes the opposing surface), and volumetric airflow—could be gradually controlled by singers through glottal adduction, epiglottis tilt (i.e., a backwards rocking motion of the epiglottis, reducing the volume of the space just above the vocal folds), laryngeal constriction, and muscular adaptations of the pulmonary apparatus, thus theoretically allowing for gradual control of the emerging phonatory frequency in artistic contexts.

Figure 2
figure 2

Simulation of hypothetical frequencies for impinging jet sound production in humans, using the model presented in20. The resulting frequencies scale linearly as a function of volumetric airflow V. The figure illustrates the case of V = 150 ml/sec, i. e., a default flow value seen in human singing36. (A) and (B) isoparametric curves for emerging mode-1 frequencies (i.e., lowest possible stable frequency of a whistle) as a function of glottal area (AGL) and impingement length (x). The gray area in (A) depicts the fundamental frequency region of the “whistle” register in female operatic singing37. (C) Strouhal number St (a dimensionless quantity describing the oscillatory flow mechanism). Stable vortex whistles are expected at d/x < St < 1, where d is the jet diameter.

Contrary to these results from the simple aeroacoustic model, empirical data from the nine investigated professional singers strongly suggest that the aeroacoustic mechanism is not the origin of high-pitched soprano singing. As compelling supporting evidence for the MEAD mechanism, we found vocal fold vibration and collision in all nine participants (see supplementary materials for HSV samples from all participants). In all investigated sopranos, the vibratory frequency of the medio-lateral vocal fold oscillation corresponded to the fo of the radiated sound (see Suppl. Fig. S1). This suggests that the tissue oscillation is causal to sound generation, which is highly indicative of the MEAD principle. This phenomenon is documented exemplarily in Fig. 3: The HSV still images in Fig. 3D document full glottal closure along the entire visible vocal fold length. The electroglottographic (EGG) signal (panel C) shows clear cyclic variation of vocal fold contact along the sagittal glottal plane. The glottal area, i. e., the opening between the left and right vocal folds co-varied in synchrony with the EGG signal. The resulting glottal area waveform (panel B) reached a maximum when the relative vocal fold contact area, as retrieved by the EGG signal, was at a minimum. This is in good agreement with prototypical human voice production in the modal register (e. g. in speech), where the vocal fold contact assumes a maximum when the glottis is closed), and vice-versa38. Without exception, the acoustic signals captured from the singers resembled a harmonic series with a number of noteworthy harmonics (between two and seven) throughout the examined vocal range, from pitch C6 (fo ≈ 1047 Hz) to G6 (fo ≈ 1568 Hz) in all investigated sopranos, with participant S3 achieving phonation at musical pitch B6 (fo ≈ 1975 Hz—see Suppl. Fig. S2, out of the regular experimental protocol). The sound level differences (H/-H2) between the first and second harmonic of the radiated acoustic signal were in the range of 23.47 (± 8.21) dB for the lowest fo and 17.20 (± 4.48) dB for the highest investigated fo of each participant (see Suppl. Fig. S3,S4 for details).

Figure 3
figure 3

Example of high-pitched phonation of S3. (A) Acoustic spectrogram; (B) three cycles of laryngeal oscillation at t ≈ 1.43 s, showing vocal fold contact area documented with electroglottography (EGG); (C) time-varying glottal area waveform (GAW), as documented by high-speed video (HSV) recording at 20,000 frames per second. The arrows indicate the still HSV frames shown in panel d; (D) HSV frames extracted at the incidents indicated by the arrows in panels B and C. Note the full glottal closure in the third out of the five displayed video frames.

In contrast to the “default” laryngeal configuration in classical singing, which requires a moderately low vertical laryngeal position, phonation in the high-pitched soprano range was invariably facilitated by a raised larynx and moderate to extreme medialization of the ventricular folds (see Suppl. Fig. S5,S6). This suggests that the respective fo could only be achieved with pronounced larynx elevation and/or constriction. Overall, we found the following stereotypical glottal configurations at the highest examined pitches, which are documented in Fig. 4 and Suppl. Fig. S5:

  • Four participants (S1, S2, S7, and S8) phonated with a posterior glottal gap (denoted as glottal configuration I throughout the remainder of this manuscript), suggesting incomplete vocal fold adduction (see Fig. 4A for an example). They had completely separated vocal folds (along the entire visible anterior-posterior length) in the open phase and a partially closed glottis in the closed phase (i.e., the duration of the oscillatory cycle where the vocal folds are in contact, temporarily stopping or at least greatly reducing the laryngeal air flow), with vocal fold contact along 44 % to 75 % of the visible glottal length.

  • In contrast, the five other participants (S3, S4, S5, S6, and S9) phonated with full glottal closure (100 % vocal fold contact) in the closed phase, but with different configurations during the open phase. Three participants (S3, S5, and S6) had only a partial opening of the vocal folds, occurring along 40 % to 50 % of the visible vocal fold length (glottal configuration IIa)—see Fig. 4B for an example. The other two participants (S4 and S9) phonated with a fully opened visible glottis in the open phase (glottal configuration IIb).

Figure 4
figure 4

Stereotypical glottal configurations in high-pitched operatic soprano singing. Two main strategies emerged: (A) glottal configuration I: phonation with incomplete vocal fold adduction, resulting in a posterior glottal gap during vibration (even in the “closed” phase); and (B) glottal configuration II: greatly increased adduction of the arytenoids, supported by medialization of the ventricular folds; (C) glottal opening profiles for all strategies—see supplementary materials S5 and S6 for details.

Respective documentation for all sopranos is provided in the supplementary materials (Suppl. Fig. S5,S6) and in Fig. 5. These data clearly corroborate the observation made in the HSV footage: The blue areas in the detail panels of Fig. 5 for S1 through S9 (A) are indicative of glottal closure and vocal fold collision, which occurred either partially (S1, S2, S7, and S8) or along the entire antero-posterior length of the visible glottis (S3, S4, S5, S6, and S9). Due to the observed vocal fold collision, the closed quotient (CQ), i. e., the relative duration of glottal closure over one vibratory cycle, was non-zero in most instances. Averaging all computed CQ values across the entire antero-posterior glottal length across all nine participants resulted in a median CQ value of 47.6%, with 5 and 95 percentiles at 29.6% and 73.0%, respectively.

Figure 5
figure 5

Vocal fold vibration analysis for phonation at pitch G6 (ca. 1568 Hz). (A) Glottovibrograms (GVG) for all investigated sopranos (S1 through S9); (B) individual and averaged glottal closed quotients along the antero-posterior axis (cf. Figure 4 in39).

We successfully reproduced the high-pitched soprano voice production with a finite difference model of vocal fold tissue vibration with string-like restoring forces. The two glottal configurations I and II were simulated with “weak” and “tight” vocal fold adduction, regulated via the pre-phonatory distance between the vocal processes of the arytenoid cartilages (with d = 0.6 mm and d = 0.1 mm for weak and tight adduction, respectively). Phonation with weak adduction resulted in a posterior glottal gap (Fig. 6A,B), all other parameters being equal across the two conditions. The emerging fo was 1,540 Hz and 1597 Hz for weak and tight adduction. Results showed that string-like vocal tissue layers (mucosa and ligament), both with a fiber stress of 0.9 MPa, produced self-sustained vocal fold oscillation, again corroborating the MEAD production mechanism. This was the case for both weak and tight adduction scenarios. With weak adduction, the larger time-varying glottal area (Fig. 6C) caused larger airflow rates that were non-zero in the “closed” phase (Fig. 6D). This resulted in a reduced strength of the second harmonic in the frequency spectrum (Fig. 6F), as compared to phonation with strong adduction (Fig. 6E).

Figure 6
figure 6

Computer simulation with a low-order finite difference model of vocal fold tissue vibration. (A) and (B) two pre-phonatory glottal configurations, resembling strong and weak adduction; (C) and (D) resulting glottal area and glottal airflow for high-pitched simulations with both pre-phonatory glottal configurations; (E) and (F) normalized spectra of glottal airflow resulting from strong and weak adduction.

Discussion

This study investigates high-pitched operatic phonation in the so-called whistle register above C6 using super high-speed laryngoscopy and computational modeling. Our data suggest that high-pitched soprano operatic singing is not produced by an aerodynamic whistle. Rather, we found medio-lateral vocal fold vibration synchronous to the variation of the radiated acoustic pressure in all investigated sopranos when phonating at fo ≈ 1.6 kHz. The observed medio-lateral vibratory component—in five out of nine singers resulting in full glottal closure—is a fundamental requirement for voice production according to the MEAD principle15. This medio-lateral vibratory component, resulting in a cyclical variation of the glottal area at the observed fundamental frequencies would be clearly detrimental to sound production with an aeroacoustic phenomenon. This is because a time-varying glottal area—causing time-varying airflow rates at the rate of the fundamental frequency—would introduce a considerable amount of frequency modulation (FM) into the putatively emerging aeroacoustic sound, thus violating the requirement to produce voice at quasi-stationary fo conditions in artistic singing. Furthermore, the acoustic signals contained a well-defined harmonic structure, with the second harmonic having a sound level that is only about 20 dB lower than that of the fundamental. This clearly contrasts true aeroacoustic sound production like rodent ultrasonic vocalization40 or human lip whistling41, where the second harmonic’s level is 40 dB or more below that of the fundamental.

For these reasons, an aeroacoustic sound production phenomenon can be clearly ruled out for the investigated high-frequency operatic singing style. Consequently and in agreement to previous studies5,6,8,28,32,33,34,35, the frequently used term “whistle register”—while potentially applicable to other types of ultra-high-pitched voice production—does not reflect the physiologic voice production mechanism for classical/operatic singing at these high frequencies. Further, register names are commonly deducted rather from perceptive factors, such as “head register” and “chest register”, which are not necessarily scientifically appropriate but still established. However, it is precisely when using the term “whistle register” for the pitch range analyzed in the presented study that the underlying physiological principles should be correctly classified.

Female speech occurs at an average fo of approximately 200 Hz1. The lowest pitches of adult females are found at about 135 Hz1, and the highest pitches of operatic soprano singing, investigated here, occur at about 1.57 kHz (and ca. 2 kHz in one case, see Suppl. Fig. S2), thus covering a range of almost four octaves. This is in good agreement with the predicted fo range for different mammalian species at large42.

It is remarkable that the observed vibratory characteristics of the vocal folds documented here (recall Fig. 5B, showing median CQ values of 47.6%) closely resemble those reported for singing voice production in the M2 (“falsetto”) register found at relatively lower fo. For M2 phonation, Henrich et al.43 documented EGG contact quotients in the range of 5–50%, Herbst et al.44 reported videokymographic closed quotients in the range of zero to about 50%, and Echternach et al.45 reported high speed videolaryngoscopically derived closed quotients of the glottal area from 0 to 50%.

Furthermore, during the review process of the presented manuscript, Kato et al. published a quite comparable study in non-professional-singers subjects, analyzing pitches from C6 to A6, however using rigid transoral laryngoscopy during high speed recordings46. Although a transoral rigid endoscopy might have affected vocal fold tensions and vocal tract/voice source interactions, also these authors documented vocal fold oscillations in all of their 6 subjects.

Considering these findings, we conclude that the laryngeal vibratory phenomena of high-pitched operatic soprano singing are comparable to what is seen in the M2 mechanism. The auditory perceptual distinction of the investigated type of voice production (commonly termed the M3 mechanism) from the lower-pitched M2 vocal register is likely caused by influences of the vocal tract, as suggested by previous research47,48,49,50.

Types of vocal fold closure

Five of the nine participants phonated with a fully adducted posterior glottis, and in three of these the vocal folds were partially in contact along the antero-posterior mid-line at the moment of maximum glottal opening. This might be indicative of a “damping” phenomenon in analogy to violin playing—shortening the vibrating portion of a violin string with finger pressure—as previously proposed by some authors51,52,53,54. In such a “damping” mechanism, control would be facilitated by adjustment (and, specifically, shortening) of the vocal fold portion that is in vibration, brought about by high degrees of vocal fold adduction and arytenoid compression. It is, however, unlikely that a medial (adductory) pressure can establish a fixed boundary without excessively constricting the entire glottis. Further, the “pinning” force would be in the wrong direction. It is not perpendicular to the vibration, as in pinning a violin string, but rather in the direction of motion. This would establish a “fuzzy” boundary point, unlikely to be controllable over a wide pitch range.

The damping concept was also not supported by our computational model. For a “damping” phenomenon to occur, the physical boundaries of vocal fold vibration would have to be varied in antero-posterior direction, which was not observed in vivo and could also not be reproduced in silico. Furthermore, contraction of the thyroarytenoid (TA) muscle, which may normally contribute to medial compression during voice production in M1 and M22, does not have much effect when counteracted by extremely high ligament stiffness. We therefore propose the following alternative hypothesis: MEAD phonation at the investigated range requires unusually high activity in the cricothyroid (CT) muscle in order to influence the ligament stiffness that is required for the targeted fo. However, all else being equal, such a maneuver may lead to abduction of the posterior glottis55, which is detrimental to voice source strength (recall Fig. 6E). Consequently, singers try to counteract this tendency to vocal fold abduction by increasing glottal closure through substantial degrees of laryngeal medialization (i. e., adduction of arytenoids, vocal folds, and ventricular folds—recall Suppl. Fig. S5,S6), in order to maximize the achievable degrees of glottal closure and thus increase the amplitudes of the higher harmonics of the voice source. The observed glottal configurations (I, IIa, and IIb) were thus most likely the result of individual anatomical and physiological predisposition.

Findings in relation to other types of high-pitched singing

Summarizing, our investigation was concerned with investigation of high-pitched operatic soprano singing. However, ultra-high-pitched singing of both female and male singers is also found in other singing styles. Specifically, some contemporary commercial music (CCM) singers regularly phonate in the fo range of 2–3 kHz47. Furthermore, performers singing in the “extranormal voice” style have been reported to phonate at fo of up to 6 kHz31, and it was speculated that this “M4” phonation would be produced with a “vortex whistle”56. Notably in this context, Tsai et al.26 suggested a diffuser jet with periodic vorticity bursts in the larynx for phonation at fo ≈ 4 kHz. That respective phonation was investigated with ultrasound Doppler imaging, revealing a vocal fold vibratory amplitude of 0.1 mm, i.e., only barely visible to the naked eye. This is in agreement with Di Corcia et al.28, who reported that their participants’ “stop closure whistle” was produced during total absence of a mucosal wave and thus vocal fold vibration. These findings suggest that further investigation of the ultra-high-pitched phonations of humans is required. While—based on our data and findings—we can conclusively show that high-pitched operatic soprano singing is produced according to the MEAD principle, we cannot altogether rule out an aeroacoustic production mechanism for other human singing styles, for example CCM, contemporary experimental music, non-professional classical singing, or folk styles, at higher fundamental frequencies. In this respect, in contrast to western classically trained singing, in CCM, amplification techniques are commonly used for high-pitched singing. If an aeroacoustic mechanism would be present, it could be expected that the radiated sound would be rather weak. However, this could be counteracted by amplification. If an aeroacoustic production mechanism could be empirically documented in non-classical high-pitched singing or any other human voice production, it would be of interest to explore in future experiments if it is possible to drive the vocal folds as a passive, coupled resonator with such a sound generating mechanism.

Limitations

The transnasal endoscopic data acquisition might have caused some irritation for the participants and it cannot be excluded that it led to tightening reflexes or increased muscle tension. However, none of the participants reported such irritation and none of them canceled the ongoing procedure. The advantage over exclusively electroglottographic measurements is the detailed observation of the vocal folds’ configuration and movement. In addition, for very high pitches, the larynx is often raised and the vocal tract tube is narrowed. This might cause slipping of the electrodes and corruption of the signal. It could be of interest how singers estimate their voice production at such high pitches, i.e. if they use AA or MEAD and if there is a full closure for the MEAD. It is a limitation of the present study that no systematic evaluation of the participants’ proprioceptive feedback concerning register transitions and phonation mechanism was performed. Such systematization of proprioception linked to actual physiology could, however, be a subject of future research.

Finally, the acquired voice signals contained aliasing artifacts above 4 kHz. However, given that the investigated fundamental frequencies were typically well below 2000 kHz, it was safe to compute both the fo and H1-H2 metrics of those signals (see Methods for an in-depth discussion).

Materials and methods

All methods were carried out in accordance with relevant guidelines and regulations. Furthermore, all experimental protocols were approved by the Freiburg University Ethical Committee (Nr 380/12). Informed consent was obtained from all subjects.

Participants and phonatory task

Following approval from the local ethical committee, nine female professional operatic soprano singers were investigated. All participants had at least 4 years of professional training in western classical singing. Limited biographic data and a taxonomic classification according to the scheme proposed by Bunch and Chapman57 are provided in Table 1. At the time of the experiment none of the participants complained of any vocal symptoms, and vocal pathologies were excluded based on videostroboscopy and/or high-speed digital imaging evidence.

Table 1 Overview of participants, indicating their proficiency level (taxonomy according to Bunch and Chapman57), investigated phonatory fo, and observed glottal configuration.

The participants were asked to sing an ascending major scale on the vowel [a:] from musical pitch C6 (ca. 1047 Hz) to G6 (ca. 1568 Hz), avoiding extensive vibrato. Each note should last approximately one second. As pointed out in the introduction, there is no consensus at which musical pitch the so-called whistle register would start. However, there seems to be a general agreement that the musical pitch G6 falls within the whistle register. This motivated the choice of the pitch range in the presented study. The vowel [a:] was chosen because it could be expected that above fo of 700–800 Hz classical singers will only produce this vowel quality: When fo reaches the center frequency of the lowest vocal tract resonance (fR1), singers tend to avoid a crossing of fo and fR1, thus raising fR1 as fo further increases50,58. All except two participants (S6 and S9) successfully accomplished the phonatory task, while S6 and S9 could only reach the musical pitch F6 (fo ≈ 1397 Hz).

Data acquisition

Analogous to previous investigations45,59,60,61, High Speed Digital Videolaryngoscopy (HSV) was performed using trans-nasal endoscopy. A flexible endoscope (ENF GP, Fa. Olympus, Hamburg, Germany) was mounted on a 38 mm C-mount adapter (Karl Storz, Tuttlingen, Germany), connected to a Photron high-speed camera (Fastcam SA-X2, Photron, Tokyo, Japan) and a 300 W light source (Karl Storz, Tuttlingen, Germany) which was operated at a frame rate of 20,000 frames per second (fps) and a spatial resolution of 386 × 320 pixels.

Simultaneously with the HSV recordings, time-synchronous acoustic and electroglottographic (EGG) signals were captured with a National Instruments (Austin, Texas, USA) DAC USB-6251 interface at a sampling rate of 20,000 Hz. The DAC was automatically triggered (i.e., switched on and off) via a TTL signal emitted by the HSV camera. The acoustic signal was captured with an omnidirectional microphone (Behringer ECM-8000, Behringer, or Sennheiser ME 62, Sennheiser, Wedemark, Germany) and a Mackie 802VLZ4 (Bothell WA, USA) preamplifier. The EGG signal was acquired using a dual-channel EGG device (EG2-PCX2, Glottal Enterprises, Syracuse, NY, USA). Due to the lack of a hardware low-pass filter and because of the relatively low sampling frequency of 20,000 Hz (with a Nyquist frequency of 10,000 Hz), the acoustic signals contained mild traces of aliasing artifacts. Spectrographic inspection of these acoustic signals using a spectrogram dynamic range of 90 dB revealed that the aliasing artifacts were on average above 5267 Hz, with a standard deviation of about 900 Hz. This suggests that—while the spectral acoustic data needs to be considered with care—it was safe to utilize the acoustic signals for both fundamental frequency estimation and computation of the intensity relation of the first to the second harmonic found in the signals. That latter approach is justified because even for the highest investigated musical pitch (G6) at about 1568 Hz, the second harmonic was well below the lowest frequency where aliasing artifacts were observed in the acoustic signals.

In order to verify the accuracy of the signals’ synchronization, a custom-made rotating disk with a printed black and white pattern was synchronously recorded with both HSV and a simple electric circuit containing a photodiode that monitored the rotating disk’s light intensity. The output of the photodiode current was routed to the acoustic channel of the DAC. The digitally computed light intensity of the respective HSV recording was compared with the photodiode current variation seen in the DAC’s input channel, and a perfect temporal agreement was found.

HSV pre-processing and data analysis

The segmentation of the visible glottis and the medio-lateral vocal fold deflections—as documented by HSV—required several pre-processing steps that were accomplished through scripts implemented in the Matlab framework (R2014b, MathWorks Inc., Natick, MA, USA)45,59,60,61:

  1. (a)

    In some cases, a honeycomb structure introduced by the endoscope optics was visible in the HSV recordings. This artifact was removed with a frequency-selective FFT-filter by transforming the images into the frequency domain via a 2-D discrete Fourier transform. Therein, the periodic noise appeared as 2 major peaks, apart from the center frequency peak. These two peaks were identified via an adaptable threshold binarization, and the areas were slightly increased via an opening filter and set to 0. The images were then transformed back into the image domain.

  2. (b)

    Because the angle of the glottis could change with respect to the orientation of the HSV field of view during the recordings, the HSV footage had to be spatially rotated in order to align the glottis with the vertical dimension in the recordings. To this end, an approximate mask of the glottal opening was found in every image by means of time-difference images. In these ellipsoid-shaped masks, the orientation of the main axis of the glottis-opening could be calculated. Here, we assumed that the glottis shows the most movement between discrete images in the video. Hence, we calculated difference images from pictures with an adaptable time offset (typically set to 5 frames). Via various image processing methods, such as binarization, opening filter, and search for connected components, the largest object with the greatest movement was identified. The major axes of the resulting elliptical object were then determined, and the whole image could then be rotated so that the glottis was in vertical alignment. In most cases, some manual adjustments of the resulting angle graphs were necessary in order to remove sudden and improbable changes.

  3. (c)

    In a last pre-processing step, a bounding box was manually drawn around the glottis at key frames, and all images of a sequence were correspondingly cropped. As a result, the glottis now appeared both vertically and horizontally centered in every frame of the HSV sequence.

Glottis segmentation—and thus determination of the time-varying medio-lateral vocal fold displacement along the antero-posterior glottal dimension—was performed using the custom-made Glottis Analysis Tools (Denis Dubrovskiy and Michael Döllinger, Erlangen University, Germany), as described previously62.

For further analysis, the glottal area waveform (GAW) was computed, producing the time-varying visible area of the glottis, indicated in pixels. The glottal segmentation data was also utilized to compute phonovibrograms63, i. e., a visualization procedure that extracts vocal fold vibrations from HSV data and transfers the motion information into a set of displacement data for both the left and the right vocal fold63. The PVG information pertaining to the individual vocal folds was then combined to a glottovibrogram64 (GVG), representing the time-varying glottal width in pixels along the antero-posterior glottal dimension. The GVG data was then used to compute the glottal closed quotients (i. e., the relative duration of vocal fold collision per vibratory cycle, expressed in percent) along the antero-posterior glottal axis. These last two processing steps, as well as the generation and assembly of all figures in this manuscript, were achieved with scripts written in the Python programming language.

Analysis of voice signals

After recording and segmentation of the glottis from the HSV material all voice signals (GAW, EGG signal and the audio signal, respectively) were analyzed concerning fo using an auto-correlation method within the custom made MultiSignalAnalyzer Software65. In order to avoid irregularities occurring during possible fo transitions, only the stable part of each phonation was analyzed using a time window of 100 ms at the temporal midpoint of each phonatory (musical) pitch (midpoint ± 50 ms). The relative sound level difference between the first and the second harmonic (H1–H2, expressed in decibel (dB)66) was computed for all phonations at lowest and highest phonations, attempted at musical pitches C6 and G6. Because of some aliasing phenomena in the audio signals, calibration of the sound pressure level was considered problematic.

Aeroacoustic model

For the aeroacoustic model used for computing the data shown in Fig. 2 we applied the following reasoning. When, in a hypothetical aeroacoustic sound source without vocal fold vibration, the glottal air flow separates from the glottis and a jet is formed, small instabilities in this glottal air jet can become entrained at certain frequencies due to a feedback loop between these downstream-traveling flow structures and acoustic waves traveling upstream20. For an impinging jet model, the emerging frequencies have been estimated with a previously20 described model established by fn = n · u/xwall, where is the fn frequency of the nth possible whistle frequency, n is the mode number, xwall is the jet length, and u is the mean convection speed of downstream moving coherent structures (i.e., the air flow speed). The mean convection u is approximated by u = V/Agl., where V is the volumetric air flow rate and A is the glottal constriction area. For the lowest possible frequency (mode-1), the model is reduced to fmode-1 = u/xwall.

Computer simulation

A finite difference model of vocal fold tissue vibration was used to generate a high fo with string-like restoring forces. However, a single string is insufficient to produce self-sustained oscillation with air passing over its surface. It requires multiple coupled strings with ribbon-like flexing of the medial surface of the vocal folds. Alternating convergent and divergent glottal shapes can then produce an aerodynamic push–pull on the vocal folds for sustained oscillation67. The vocal folds also have tissue depth in the medial–lateral direction that allows edge movement but restrains movement into the deep muscular layer. A rectangular parallelepiped with 90 coupled masses was sufficient to meet the boundary conditions and the tissue properties. The vocal fold length was 0.945 cm, the thickness 0.3 cm, and the depth 0.45 cm. In the anterior–posterior direction, 5 masses allowed string-like motion with fixed boundary conditions at both ends. Along the vocal fold thickness, 3 masses provided the ribbon-like flexure and bi-stable nature of voice registration68, and 6 masses were used laterally to allow vibration to dissipate to zero. The masses were coupled with fiber stresses of 0.9 MPa along the vocal fold length in the first two medial–lateral layers that represented the mucosa and ligament, respectively. In the 3—6 medial–lateral layers, a muscle fiber stress 5 kPa was selected, a value in the mid-range of measured thyroarytenoid muscle stress69. For shear coupling between the fibers, a gel shear modulus of 1.0 kPa was chosen according to measurement70. The damping ratio for the vibrating tissue needed to be 0.04, lower than the 0.1 value typically chosen for speech-like fundamental frequencies71. With this low damping ratio, it required a 4 kPa lung pressure to obtain self-sustained oscillation. For the 90 masses, 180 first-order differential equations were solved with a 4-th order Runge–Kutta solver14. With a simple string formula based on ligament stress and vocal fold length, the natural frequency of oscillation was predicted to be 1587 Hz.

Vocal fold adduction was controlled with one variable, the superior-posterior glottal width, which is the distance between the vocal processes of the arytenoid cartilages (i.e., the posterior cartilaginous boundary of the membranous vocal fold portion). This width was chosen to be 0.1 mm for tight adduction and 0.6 mm for weak adduction. The width varied linearly to zero at the anterior commissure. However, the glottal width did not vary vertically along the thickness of the vocal folds. In other words, the pre-phonatory glottis was neither convergent nor divergent, but rectangular.

The aero-acoustic solution was obtained for an [a:] vowel with a simplified Navier–Stokes approach as described in68. The airway geometry, from the trachea to the lips, was taken from MRI data obtained by Story et al.72.

Ethical votum

Freiburg University 380/12.