1 Introduction

Filter banks are the driving force in speech processing applications and are constructed over the frequency spectrum of a signal. Triangle, trapezoidal, or Gaussian shapes with different numbers of frequency bands were suggested for the construction of filter banks in speech processing tasks. Mel filters [111] and their cosine-transform-reduced Mel frequency cepstral coefficients (MFCCs) [4, 16, 27] are still prominent and are the most commonly used features in speech recognition. The Mel scale was introduced by Stevens, Volkmann, and Newman as a perceptual scale of different pitches evaluated as equal from a distance. The human ear is an extremely complicated frequency spectrum analyzer. Although frequency bands are important clues for capturing and understanding speech, the human brain does not rely solely on these features. Human understanding of speech and pattern matching leverages many different aspects, such as back-end ambiance, contextual features, linguistic structures, and language models complemented by the front-end frequency and spatial features, to perform a lightning speed evaluation utilizing the immense parallel processing power of the 100 billion-neuron pattern recognition network of our brain [48, 81].

Human auditory and speech production systems are extremely complicated structures. Myriad theories have been proposed to model the human auditory system. Among them are Mel filter banks, Gammatone filter banks [57], Lyon Auditory filters [73], Ensemble Interval Histograms (EIH) [40], Seneff Auditory filters [107], PLP (Perceptual Linear Prediction) [49], and trapezoidal models [2, 87, 88]. On the other hand, the human vocal tract, which produces sound signals, is another fascinating and intricate mechanism. Vocal folds create a sine wave sound through the air by vibrating at a frequency called the fundamental frequency or pitch; other parts inside the mouth, such as the teeth, tongue, lips, jaw, and even nose, form a filtering mechanism to sculpt the source signal to propagate various phones with different resonant frequencies, which are called formants.

The ERB (Equivalent Rectangular Bandwidth) gammatone model typically extracts 32 critical band filters. The ERB measures the width of the auditory bands throughout the human cochlea and follows a nearly logarithmic scale. Critical bands are also measured in the psychoacoustic experiments of gammatone filter banks. A critical band is a representation of the speech signal at a single auditory nerve unit.

Richard Lyon proposed a cochlear model that describes the human cochlea as a nonlinear filter bank. The cochlear base is stiffer than its apex and is sensitive to high frequencies. The sensitivity decreases from the base to the apex. The Lyon Cochlea model mimics the middle and outer ear in the first stage and behaves like pre-emphasis. In the second stage, an HWR (half-wave rectification) eliminates the negative parts from the input signal, as do the inner hair cells of the ear. In the last part, a cochleagram is formed representing a time–frequency space. Short-time autocorrelation (STA) is applied to the outputs to create a cochleagram for the nonstationary speech frames. A correlogram is a 3D time–frequency-lag space and involves cochleagram representation. For 16 kHz sound signals, Lyon’s cochlear model facilitates 86 filters.

Seneff’s model comprises 40 filters representing the motion of the basilar membrane and auditory nerve response. Synchrony outputs and mean rates are two outputs of the Seneff cochlear model design. The mean rates are derived from the envelope of the stage output and can be considered the spectral magnitude representations. On the other hand, synchrony outputs aim to discover the center frequencies of consonants (stops, fricatives, and sonorants).

The EIH (Ensemble Interval Histogram) is another hearing model suggested by Ghitza in 1992. The EIH model is quite similar to the Seneff model in the beginning stage; however, it constructs 85 filters compared to the 40 filters used in the Seneff model. The second part of the EIH generates histograms per channel. The final part aggregates the histograms of the second stage, known as the Ensemble Interval Histogram.

The PLP (Perceptual Linear Prediction) model was proposed by Hermansky and comprises 24 filters based on the Bark scale proposed by Zwicker as a psychoacoustic hearing model. Perceptual linear predictive coding leverages the cubic-root intensity-loudness power law and equal loudness curves to flatten the spectral magnitudes of the critical bands. PLP also uses an all-pole autoregression model to simulate the human vocal tract to provide a clear representation of the auditory spectrum. This allows the PLP to simulate human hearing better than the LPC [12]. PLP is less sensitive to noise and computationally more efficient than linear predictive coding. RASTA-PLP (RelAtive SpecTrAl PLP) [50] was launched to enhance the efficiency of PLP for communication transmission channels.

The Mel filter bank is the most common filter bank used in speech-processing applications. Mel filters have 40 triangle-shaped frequency bands that crossover with one another. However, there is no concrete agreement among researchers over the frequency regions of these bands. The frequency regions of the triangular bands of the Mel filters may be decided according to the applications (music, emotion, speech, speaker, gender, etc.). These triangular-shaped filter banks try to mimic the human auditory system. Their cousin Mel-Frequency Cepstral Coefficients (MFCCs) are computed by applying the DCT (Discrete Cosine Transform) to the logarithmic magnitude frequency spectrum of 40 Mel filters. Davis, Mermelstein, and other researchers claim that MFCC can be considered an application of principal component analysis (PCA) to the logarithmic power spectrum. While MFCC is very successful in speech recognition implementations, it is ineligible for speech synthesis due to the impossibility of reversing the DCT operation. It is also highly susceptible to noise.

Studies of the imitation of the human vocal tract date back to the late eighteenth century by Christian Kratzenstein, who explained the differences between /a/, /e/, /i/, /o/, and /u/. The quest continued with Wolfgang von Kempelen, Charles Wheatstone, Alexander Graham Bell, and Herman von Helmholtz [36, 104]. The first speech synthesis devices were introduced by Homer Dudley and Walter Lawrence [34, 68]. Gunner Fant introduced the first cascade formant synthesizer, followed by Allen, Umeda, Holmes, Rosen, and Klatt with MITalk [7, 35, 54, 65, 99, 118]. Vowel and consonant formulations have also been extensively studied by many researchers, including Peterson, Barney, Wells, Lieberman, Ladefoged, Johnson, Rabiner, Hillenbrand, Assman, Klautau, Coleman, Kewley, Cox, Bernard, Hagiwara, Harding, Picone, Stevens, Huckvale, Kidd and Jurafsky [11, 14, 22, 24, 43, 45, 51,52,53, 55, 58,59,60,61, 63, 66, 67, 72, 90, 92, 94, 95, 112, 121]. Linguistics and phonetics have also contributed to understanding the resonant frequencies of speech signals via vocal tract articulatory movements for speech synthesis and analysis [6, 32, 41, 47, 84, 126]. Vocal tract resonant frequencies of speech signals are called formants and are formed during the passage of air through the vocal path [5, 29, 56, 79, 89, 113]. VOT (Voice Onset Time) is a significant factor in the identification of stop consonants [19, 20]. Nasal consonants have their own special structures [46]. The formation of the /r/ sound is special, and its formants are strongly influenced by accompanying vowels [123]. The application areas of speech processing are too diverse, including gender, age group, and speaker recognition [78, 102]. Speech phones, particularly vowels, have also been investigated in the country domain, and there are several studies on Turkish vowels [10, 15, 17, 64, 70].

The acoustic simulation of the vocal tract can be implemented as an approximation by lossless two-tube or three-tube models [8, 9, 18, 26, 30, 35, 74, 77, 96, 97, 101]. The human ear is more discriminative but less sensitive at lower frequencies, whereas at high frequencies, it is less discriminative but more sensitive. A 3000 Hertz sound can be perceived better than a 100 Hertz sound with the same amplitude. However, in the low-frequency region, it is easier to discern different frequencies. This phenomenon constitutes the foundation of Equal Loudness Curves [37, 98, 114]. The human frequency spectrum is linearly spaced below 1200 Hz and logarithmic beyond that region. Our ear is a logarithmic frequency analyzer, and its working range is amazing. It has a 3.6 Hz frequency resolution between the 1000–2000 Hz band under ideal test conditions [80, 86, 106, 122, 125].

Given this background, we need to explain the necessity of proposing novel filters. Novel AFB filters are designed to compete with Mel filters, PLP filters, and MFCC features, which are the most widely used representations of speech processing applications. Mel filters contain 40 bands, and MFCC uses 13 of these 40 bands via discrete cosine transform. The problem with Mel filters is that they use too many filters, thereby indicating overfitting and computational and temporal overloads despite the high accuracy rates. MFCC has smaller coefficients; however, the performance of MFCC, particularly in deep learning applications, is unsatisfactory. In our study, the PLP is implemented with 21 subbands. The proposed AFB filters will provide a mechanism in between which a smaller number of coefficients will be used than for the Mel filters and will provide accuracies comparable to those of the Mel and PLP filters. Currently, AFB features are designed to include only 11 trapezoidal frequency bands. In Sect. 3, we comprehensively delineate the proposed AFB filters.

The remainder of this manuscript is organized as follows: Sect. 2 reviews related studies, Sect. 3 provides a detailed explanation of the proposed AFB filter banks, Sect. 4 discusses the speech datasets used in our experiments, Sect. 5 addresses the convolutional neural network used in the experiments, Sect. 6 presents the experimental results, and Sect. 7 finalizes the paper with the conclusions and future directions.

2 Related Works

In this text, we run experiments on the SCD (Speech Command Dataset) [120] and TIMIT [39] datasets. There are numerous studies on these widely used datasets, including that of Andrade et al. [28], who studied a convolutional recurrent neural network with attention on SCD v1 and v2. They achieved 93.9% accuracy on v2 for the 35-command recognition task with an attention-RNN model. They extracted 80-band Mel-scale features with 1024-point Fast Fourier Transform (FFT) frames and 128-point overlapping windows. The model applies a set of convolutions to the feature vector followed by a set of 2 bidirectional LSTM nodes. The LSTM output is then passed through 3 dense layers.

In [115], Toth proposed maxout neurons in convolutional neural networks as an alternative to the rectified linear unit function. They conducted experiments on the TIMIT dataset and achieved outstanding phone error rate performances. Experiments were run with 40-dimensional Mel filter bank features plus the energy of the frame. Delta and double delta coefficients are also computed to yield 123 features in total. This paper outperformed previous works by revealing a 16.5% phone error rate using the hierarchical CNN model. The author also tested the Hungarian Broadcast News Corpus as a large vocabulary continuous speech recognition task. The Szeged dataset contains 28 h of speech data from Hungarian TV channels. In the experiments, the training set utilized 22 h of data, 2 h of data were used as the validation set, and 4 h were allocated for testing purposes. In this second experiment, the proposed CNN model achieved the best performance, with a 15.5% phone error rate.

In [33], Dridi and Ouni proposed CGDNN (convolutional gated deep neural network) and conducted phoneme recognition experiments on the TIMIT dataset. The result is a 15.72% phone error rate using 40-dimensional Mel filter bank features with their delta and double delta derivatives.

In [71], Li and Zhou tested a single-layer softmax, a 3-layer fully connected DNN, and a convolutional neural network on SCD v1 for KWS (keyword spotting task). The speech wave files are processed with 30 ms framing and a stride of 10 ms to extract a 40-dimensional feature set. They used only 6 words from the SCD v1. The CNN model demonstrated extraordinarily high performance over the softmax DNN and vanilla RNN models, with accuracies of 94.5%, 71.9%, and 56.7%, respectively.

Berg et al. introduced the keyword transformer [13] to SCD v1 and v2. They used 40 Mel filters with an 80:10:10 train:validation:test set split along with data augmentation and preprocessing. They achieved 97.27% by the multihead attention-RNN model, 97.53% by KWT-2, and 97.74% by the KWT-2 distillation model on the SCD v2 with all 35 keywords.

Trinh et al. [116] experimented on SCD v2 and proposed a novel augmentation method called ImportantAug, which adds noise to the unimportant parts of speech data. They used additional noise with importance maps. They achieved a 6.7% error rate without augmentation, 6.52% with conventional noise augmentation, and 5.00% with the proposed ImportantAug method.

3 Proposed AFB Filter Banks

The structure of human hearing has been studied extensively, and many models have been proposed to imitate the auditory system. The Mel filter bank and MFCC have been used for over half a century for speech processing, and time is ripe to replace them with better features to represent speech signals. In this study, a novel filter bank strategy named acoustical filter banks (AFB) is proposed for speech processing applications to replace Mel filters, PLP filters, and MFCCs. The foundations of novel AFB filters rely heavily on the formant regions of vowels and consonants. The novel features contain only 11 marginally overlapping trapezoidal frequency subbands, as delineated in Table 1 and graphed in Fig. 1. They are less expensive to compute, provide a more compact representation of the data to obtain more information about the underlying dynamics, and offer enhanced interpretability compared to Mel filters or MFCC. In Fig. 2, we sketch the graphs of the AFB, MFCC, and Mel filters of the vowels /i/, /u/, and /a/ respectively. In speech processing, we divide the speech signal into windowed frames of a certain length, such as 25-ms frames with 10-milisecond overlaps. Therefore, processing a single phone requires several frames. In Fig. 2, each line in the graph of a vowel represents these windowed frames of the vowel signal. Although frequency information, namely, the first and second formants (\(\mathcal{F}1\), \(\mathcal{F}2\)), can strongly represent vowel sounds, consonants cannot be segregated solely by employing frequency regions. The formation type of the phone is very effective in determining the final consonant phone. It is possible to create very different consonants with the exact same spectral structure. In Fig. 3, the time-domain signals and spectrograms of the words “hissing” and “hitting” are shown. Here, we make a trick on the time-domain signal of “hissing” and silence the yellow-marked portion without interfering with the frequency domain at all. As shown in Fig. 3, the yellow marked region is a part of the consonant /s/ in the word “hissing”. This will create a silent part immediately before the second half of the phone /s/, which is commonly called VOT [19, 20], and the whole signal will be heard as “hitting” instead of “hissing”. This is one of the difficulties of speech recognition that makes it incredibly formidable along with the different lengths of each phone, from person to person or even within the same person, and varying phone boundary regions. Delta acceleration coefficients, Haar wavelets [42], or other change point detection methods can be helpful for identifying such sharp changes across frequency bands.

Table 1 Frequency bands of the novel AFB filters
Fig. 1
figure 1

Picture of the proposed Acoustic Filter Banks

Fig. 2
figure 2

Visualization of the power spectra of the AFB, MFCC, and Mel filters for the vowels /i/, /u/, and /a/. Each line represents a windowed frame of the vowel

Fig. 3
figure 3

The wave signal and frequency spectrogram of the words a “hissing” and b “hitting” after silencing the yellow marked region of the word “hissing”. Here, we do not modify the frequency domain. We silence only the region marked in yellow (Color figure online)

The construction of AFB filters heavily depends on the formant regions of vowels and consonants. For this purpose, we used the formant regions of vowels from current studies. Although vowel theory is well established, consonants need more attention and are difficult to observe since they are posing extreme complexity due to their diverse articulatory formations. In contrast to the studies about vowels, there are considerable differences among researchers about consonant bands, and consonant band studies are scarce compared to vowel studies [19, 20, 25, 38, 43, 46, 65, 66, 78, 89, 102, 121, 123, 124]. We implemented millions of binary classification experiments to determine the most discriminative frequency bands between the acoustical neighbors and similar phones by employing all possible frequency subband pairs. These experiments helped us to explore the different frequency regions between vowels and consonants, leading to the construction of fine-tuned subbands of AFB filter banks.

At the outset of our study, we set the upper frequency boundary of the AFB’s 11th filter to 5000 Hz. However, upon a more detailed investigation, we found that the phone /s/ (and arguably the phones /z/, /ʤ/, /tʃ/, /ʃ/, /Ʒ/, /k/, /g/, /t/, /d/) has wider spectral bandwidths spanning from 3000 Hz up to 7000 Hz, particularly when accompanied by the vowel /iy/ or /ih/. We experimented with 5500 Hz, 6000 Hz, and 7000 Hz boundaries, and the results are nearly identical for 6 kHz and 7 kHz, while 5500 Hz slightly lags behind. Therefore, for the sake of keeping the spectrum as narrow as possible, we selected 6000 Hz as the ending boundary of the AFB filters. We did not add a new frequency band here because we did not observe any distinct frequency region between any other phone. Another interesting finding is that the arrangement of the input features is highly effective in terms of performance. The speech signals are inherently 1D, and when they are fed into a 2D-CNN, they should be converted to a 2D matrix. The performance becomes best when they are in the matrix form of (\(frame\_count\times feature\_count)\) instead of selecting arbitrary rows and columns.

From Fig. 3, we can observe that AFB filters unearth the nature and structure of these sounds better than do the MFCC and Mel filters. Mel filters can be interpreted as better than MFCCs; however, they fall short of AFB filters. It is quite difficult for the MFCC to find any evidence about the structure of phones. AFB filters can be regarded as a compact view of Mel filters, emphasizing distinct passband regions for phonetic discrimination. The phones that are acoustic neighbors should have a disparate (passband) region where the phone is perceived exactly as it is. There are also crossover overlapping (transition band) regions between the acoustic neighbors where the phone can be perceived as either of them. Humans usually mix the pronunciation of acoustic neighbors or similar phones such as /u/-/o/, /o/-/a/, /u/-/ɯ/, /e/-/i/, /s/-/z/, /ʤ/-/tʃ/, /ʃ/-/Ʒ/, /k/-/g/, /p/-/b/, and /t/-/d/. This idea is supported by numerous studies with the highest error rates for similar phones in confusion matrices [51, 90]. Therefore, it is difficult to achieve perfect discrimination between phones. Instead, such burdens should be handled by language models, which can determine the closest matching words or sentences.

AFB filters can be represented using a nonuniform filter bank summation of the short-time Fourier transform as follows:

$$ \omega_{k} = \frac{2\pi }{N}, k = 0,\;1,\;2, \ldots ,N - 1 $$
$$ \omega_{k} = \omega \left( n \right) $$
$$ {\rm H}_{k} \left( {e^{j\omega } } \right) = W_{k} \left( {e^{{j\left( {\omega - \omega_{k} } \right)}} } \right) $$
$$ h_{k} \left( t \right) = a^{ - k/2} h\left( {b^{ - k} t,} \right) $$
$$ {\rm H}_{k} \left( {e^{j\omega } } \right) = a^{k/2} H\left( {e^{{j\omega b^{k} }} } \right) $$

The nonuniform bandwidths of subbands in AFB and nonuniform decimation are typical components of wavelet filters, where all frequency responses are obtained via frequency scaling instead of frequency shifting via short-time Fourier transform. Note that nonuniform subbands are highly compatible with the human hearing system. As an alternative for nonuniform filter bank summation, we can use Fejér–Korovkin [82] wavelet filters to construct marginally overlapping trapezoidal frequency domain filters. Fejér–Korovkin kernel \({(K}_{m})\) is defined by:

$${K}_{m}\left(\xi \right)=\left\{\begin{array}{l}\frac{2{sin}^{2}\left(\pi /(m+2)\right)}{m+2}{\left[\frac{\mathit{cos}\left(\left(m+2\right)x/2\right)}{\mathit{cos}(\pi /m+2))-\mathit{cos}(\xi )}\right]}^{2}, x\notin \mp \frac{\pi }{m+2}+2{\mathbb{Z}}\pi \\ (m+2)/2, x\in \mp \frac{\pi }{m+2}+2{\mathbb{Z}}\pi \end{array}\right\}$$

\({K}_{m}\) can be written in the form of

$${K}_{m}\left(\xi \right)=1+2\sum_{k=1}^{m}{\theta }_{m}\left(k\right)\mathit{cos}kx$$

where

$${\theta }_{m}\left(k\right)=\frac{\left[(m-k+3)\mathit{sin}\frac{k+1}{m+2}\pi -(m-k+1)\mathit{sin}\frac{k-1}{m+2}\pi \right]}{2(m+2)\mathit{sin}\left(\frac{\pi }{(m+2)}\right)}$$

The Fejér–Korovkin filter is expressed as follows:

$$ \left| {h_{0}^{m} \left( \xi \right)} \right|^{2} = \frac{1}{2\pi }\mathop \int \limits_{{ - {\raise0.7ex\hbox{$\pi $} \!\mathord{\left/ {\vphantom {\pi 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}}^{{ + {\raise0.7ex\hbox{$\pi $} \!\mathord{\left/ {\vphantom {\pi 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} K_{m} \left( {\xi - u} \right)du $$

Nonuniform m-channel quadrature mirror filters or cosine modulations ensure perfect reconstruction of signals when constrained to a paraunitary polyphase matrix with significant simplification even in multirate systems. Cosine-modulated analysis and synthesis filters can also represent AFB filters [119]. A comprehensive discussion of these filters is beyond the scope of this manuscript. In Fig. 4, Fejér–Korovkin filters and their normalized frequency magnitude responses are depicted. As seen from c) and d) of Fig. 4, they produce a near-trapezoidal-like frequency domain bandpass filter bank. AFB filters can be used as analysis filters as well as synthesis filters for the perfect reconstruction of the signal.

Fig. 4
figure 4

Fejér–Korovkin filters and their frequency magnitude responses

Vowels and consonants can also be studied and simulated using acoustic tube models. An approximate 3D drawing of the vocal tract simulator with 4 concatenated tubes and the nasal cavity is shown in Fig. 5. In a closed acoustic tube, sound waves are governed by the following equations:

$$ - \frac{{\partial {\varvec{p}}}}{{\partial {\varvec{x}}}} = \frac{{\varvec{\rho}}}{{\varvec{S}}}\frac{{\partial \left( {\varvec{u}} \right)}}{{\partial {\varvec{t}}}} $$
$$ - \frac{{\partial {\varvec{u}}}}{{\partial {\varvec{x}}}} = \frac{{\varvec{S}}}{{\user2{\rho c}^{2} }}\frac{{\partial \left( {\varvec{p}} \right)}}{{\partial {\varvec{t}}}} $$
Fig. 5
figure 5

Simulation of the vocal tract using concatenated acoustic tubes

In this equation, the cross-sectional surface area \(\left({S}_{k}, {S}_{k+1}\right)\) of the \({k}^{th}\) tube is assumed fixed such that \(S\left(x, t\right)=S, p\) is the pressure, u is the velocity of the wave at the x position and t time, c and ρ are the density of air and the velocity of the sound wave in the acoustic tube, respectively. Nonuniform lossless tubes have no closed-form solutions. Solving the wave equations remains difficult even when the above assumptions are used. The solution of the above equation for the \({k}^{th}\) tube can be written as follows:

$${p}_{k}\left(x,t\right)=\frac{\rho c}{{S}_{k}}\left[{u}_{k}^{+}\left(t-\frac{x}{c}\right)+{u}_{k}^{-}\left(t+\frac{x}{c}\right)\right], 0\le x\le {{\ell}}_{k}$$
$${u}_{k}\left(x,t\right)={u}_{k}^{+}\left(t-\frac{x}{c}\right)-{u}_{k}^{-}\left(t+\frac{x}{c}\right), 0\le x\le {{\ell}}_{k}$$

where \({u}_{k}^{+}\left(t-x/\text{c}\right)\) is the forward wave, \({u}_{k}^{-}\left(t-x/\text{c}\right)\) is the backward wave, and \({{\ell}}_{k}\) denotes the length of the \({k}^{th}\) acoustic tube. Using the flow and pressure continuity at the junction of the tubes, we can derive the following equations in matrix form:

$$flow \; continuity \left({U}_{k}-{V}_{k}\right)=\left({W}_{k+1}-{X}_{k+1}\right)$$
$$pressure \; continuity \frac{\rho c}{{S}_{k}}\left({U}_{k}+{V}_{k}\right)=\frac{\rho c}{{S}_{k+1}}\left({W}_{k+1}+{X}_{k+1}\right)$$
$$\left[\begin{array}{cc}1& -1\\ {S}_{k+1}& {S}_{k+1}\end{array}\right]\left[\begin{array}{c}{U}_{k}\\ {V}_{k}\end{array}\right]=\left[\begin{array}{cc}1& -1\\ {S}_{k}& {S}_{k}\end{array}\right]\left[\begin{array}{c}{W}_{k+1}\\ {X}_{k+1}\end{array}\right]$$
$$\left[\begin{array}{c}{U}_{k}\\ {V}_{k}\end{array}\right]=\frac{1}{{2S}_{k+1}}\left[\begin{array}{cc}{S}_{k+1}& 1\\ -{S}_{k+1}& 1\end{array}\right]\left[\begin{array}{cc}1& -1\\ {S}_{k}& {S}_{k}\end{array}\right]\left[\begin{array}{c}{W}_{k+1}\\ {X}_{k+1}\end{array}\right]$$

By defining \({r}_{k}\) as the amount of \({u}_{k+1}^{-}\left(t\right)\) that is reflected at the junction point,

$${r}_{k}=\frac{{S}_{k+1}-{S}_{k}}{{S}_{k+1}+{S}_{k}}, (-1\le {r}_{k}\le +1)$$
$$\left[\begin{array}{c}{U}_{k}\\ {V}_{k}\end{array}\right]=\frac{1}{1+{r}_{k}}\left[\begin{array}{cc}1& -{r}_{k}\\ -{r}_{k}& 1\end{array}\right]\left[\begin{array}{c}{W}_{k+1}\\ {X}_{k+1}\end{array}\right]$$

We can convert this time delay to multiplication using the z-transform with \({z}^{-1/2}\):

$${U}_{k}(z)={z}^{-1/2}{X}_{k+1}(z)$$
$${V}_{k}(z)={z}^{+1/2}{W}_{k+1}(z)$$

For an n-segment concatenated tube with the assumptions of \({V}_{{\ell}}=0\) (no reflection at the tip of the mouth), \({V}_{g}\) can be ignored due to absorption by the lungs.

$$\left[\begin{array}{c}{U}_{g}\\ {V}_{g}\end{array}\right]={z}^{1/2}\left[\begin{array}{cc}1& 0\\ 0& {z}^{-1}\end{array}\right]\left[\begin{array}{c}{W}_{k+1}\\ {X}_{k+1}\end{array}\right]= \frac{{z}^{n/2}}{ \prod_{k=0}^{n}(1+{r}_{k})}\prod_{k=0}^{n-1}\left[\begin{array}{cc}1& {-r}_{k}{z}^{-1}\\ {-r}_{k}& {z}^{-1}\end{array}\right]\times \left[\begin{array}{c}1\\ {-r}_{n}\end{array}\right]{U}_{{\ell}}$$

Hence, the transfer function is defined as:

$$V(z)=\frac{{U}_{l}}{{U}_{g}}=\frac{G{z}^{-n/2}}{1-{\alpha }_{1}{z}^{-1}-{\alpha }_{2}{z}^{-2}-....-{\alpha }_{p}{z}^{-n}}$$

where G is the gain, \({z}^{-n/2}\) is the delay across the vocal tract, and the denominator part of the transfer function equation is an all-pole filter with the order of n. With the help of closed acoustic tubes, we can explore the formants of vowels and consonants.

There are many studies on the formants of vowels and consonants. Some studies, such as the Hillenbrand and North Texas vowel datasets, provide \({f}_{0}\), \(\mathcal{F}1\), \(\mathcal{F}2\), \(\mathcal{F}3\), and even \(\mathcal{F}4\) values. Currently, most of them agree on the frequency regions of vowels, except for some rare cases such as /uu/ and /al/, as depicted in Figs. 6 and 7 of the Hillenbrand and North Texas vowel datasets. The first formant together with the second formant can adequately represent the vowels. I would suggest that in some vowels such as /iy/, we may not need two formants. In phone /iy/, \(\mathcal{F}2\) can sufficiently represent the vowel. Philips et al. also studied single formants [91].

Fig. 6
figure 6

Comparison of fundamental and formant frequency values of the Hillenbrand and North Texas datasets for the vowels /oo/, /ow/, /oa/, /er/, /ah/, /aw/, and /uh/. The boy, girl, woman, and kid classes are sorted by the \({f}_{0}\) value in ascending order, and the man class is sorted by the \({f}_{0}\) value in descending order for better visual discrimination

Fig. 7
figure 7

Comparison of fundamental and formant frequency values of the Hillenbrand and North Texas datasets for the vowels /al/, /el/, /ee/, /il/, and /ii/. The boy, girl, woman, and kid classes are sorted by the \({f}_{0}\) value in ascending order, and the man class is sorted by the \({f}_{0}\) value in descending order for better visual discrimination

It is well known that when a speech signal is high-pass-filtered above 500 Hz, the intelligibility is nearly intact except for the level of loudness due to the loss of fundamental frequency components, which is almost always below 500 Hz [31]. Interestingly, 500 Hz is the first resonant frequency of a neutral vocal tract shape that produces the vowel /e/. I would make a suggestion that the formants that are below 500 Hz should be discarded as they are not required for the perception and intelligibility of the phones. This region below 500 Hz is not significant for speech recognition. For instance, when a high-pass filter with a cutoff frequency of 1000 Hz is applied to the sentence “She sees seas”, it is still intelligible except for some levels of loudness loss. All the phones in this sentence have formant bands above 1000 Hz except for the first formant of /iy/. Even though a high-pass filter is applied twice to remove possible artifact remnants, it does not lose its intelligibility, which cannot be explained by means of missing \(\mathcal{F}1\), such as the long-disputed missing \({f}_{0}\) concept. However, we should keep in mind that removing a frequency component below 500 Hz does not necessarily mean the removal of its perception by the human brain, as in the case of long-standing missing fundamental dilemma [85, 103, 105, 117]. Another interesting phenomenon related to speech is the difference between genders. The voices of men, women, boys, and girls have different characteristics [23, 83]. Speech processing techniques also use a variety of features, including wavelet and wavelet packet transforms [75, 108].

The Hillenbrand and North Texas datasets provide valuable information about the formants of vowels; hence, we opted to present them in Figs. 6 and 7. Each vowel class is depicted in the order of boy-girl-man-woman in the Hillenbrand dataset and kid-man-woman in the North Texas dataset in Figs. 6 and 7. In the figures, the boy, girl, woman, and kid classes are sorted by the \({f}_{0}\) value in ascending order; however, the man class is sorted by the \({f}_{0}\) value in descending order for better visual discrimination. Kids are at ages 3, 5, and 7 in the North Texas dataset, whereas there is no age information in the Hillenbrand dataset. Hillenbrand dataset contains 1668 vowel samples (540 males, 576 female, 324 boys, 228 girls) and North Texas datasets contains 3314 vowel samples (972 kids, 1232 males, 1110 females). In Figs. 6 and 7, the vertical axis denotes the frequency edges of the AFB filters, and the horizontal axis denotes the vowel ARPABET class with the sample number.

In North Texas and Hillenbrand, \(\mathcal{F}1\) and \(\mathcal{F}2\) of /aa/ and /aw/ phones are nearly in the same regions. In the North Texas dataset, \(\mathcal{F}2\) is higher than that in the Hillenbrand dataset for /uh/, /ul/, and /uu/ phones. In North Texas, \(\mathcal{F}2\) of /ul/ and /uu/ is greater for children and women, which may be due to mispronunciation and mislabeling, particularly for kids, accents, formant calculation errors, or outliers. \(\mathcal{F}1\) is in the same place in all (/oo/, /uw/, /oa/, /er/, /ah/, /aw/, /uh/). The phone /oo/ is sometimes pronounced like /u:/, as in the case of hue with the tongue slightly forward, thus raising \(\mathcal{F}2\). In the North Texas dataset, the \(\mathcal{F}2\) of /oo/ is greater than the \(\mathcal{F}2\) of /oa/ of the Hillenbrand vowel dataset. Examining the /oa/ and /er/ phones in the Hillenbrand vowel dataset and /oo/ and /er/ in the North Texas vowel dataset, we observe that the \(\mathcal{F}2\) formant of /er/ is shifted one level upward compared to the /oa/ of Hillenbrand or /oo/ of the North Texas dataset, while \(\mathcal{F}1\) remains nearly on the same band. This is a great clue for exploring articulatory movements. There are two main vocal tract articulations, namely, tongue and lip movements. Jaw movement corresponds to movement of the tongue up or down. With regard to the pronunciation of /er/, we moved our tongue slightly further than we did with respect to /oa/ and /oo/. Therefore, we can conclude that the tongue forward raises the F2 formant. Lip forwarding is the other articulatory movement and lowers the \(\mathcal{F}1\) formant. In Fig. 7, \(\mathcal{F}1\) is in the same place in all (/al/, /el/, /ee/, /il/, /ii/) vowels except /al/ of kids and women. \(\mathcal{F}2\) is approximately in the same place in all (/al/, /el/, /ee/, /il/, /ii/). The fundamental frequency \({f}_{0}\) is the same for all vowels, with nearly perfect agreement in both datasets. The formant structures of /ih/, /iy/, /ii/, /ae/, /eh/, /ei/, /al/, /el/, and /ee/ are very clearly identified in both the Hillenbrand and North Texas datasets. Almost all studies on speech analysis agree on this issue with subtle differences [5, 7, 10, 11, 14, 15, 17, 24, 35, 51, 52, 56, 60, 61, 64, 65, 70, 72, 79, 89, 90, 113, 121]. From Figs. 6 and 7, we can observe how elegantly the formants align with the frequency regions of AFB filter banks. The diphthongs /oy/, /ay/, and /ey/ should be considered as the concatenation of /oa/, /aa/, and /eh/ with /iy/, respectively. Therefore, the frequency spectra of /oy/, /ay/, and /ey/ are strongly affected by the frequency spectrum of /iy/ due to the time-blindness of the FFT. In TIMIT, /ao/ pronunciation is sometimes like /ow/ and sometimes like /aa/. The formant scatter plots do not provide enough information for vowel discrimination; hence, we tested other plot types. The histograms and boxplots of the AFB filters of vowels in the Hillenbrand, North Texas, and TIMIT datasets are presented. Boxplot representation produces better visualization for the discrimination of vowels regarding the median points.

As we stated, consonants cannot be separated solely by means of formants. Some consonants have the same formant structure, but their articulatory formations are different. We can convert a consonant to another consonant by applying appropriate low-pass, high-pass, bandpass, or bandstop filters. The phone /Ʒ/ can be converted to /s/ by applying a high-pass filter above a 3000 Hz cutoff frequency when accompanied by /u/; on the other hand, when flanked by phone /i/, the cutoff frequency may increase up to 4000 Hz. Removing the frequency bands between 500 and 3000 Hz using a bandstop filter will convert /Ʒ/ into /z/. When accompanied by /i/, this bandstop region may extend between 500 and 4000 Hz. It is also possible to convert /Ʒ/ into /ʃ/ by applying a bandstop filter between 0 and 2000 Hz. However, if /Ʒ/ is coupled with /i/, this bandstop region will extend between 0 and 3000 Hz. There are some other conversions by means of adding or removing VOT before some phones. As illustrated in Fig. 3, we can silence the first half of the /s/ and convert it to /t/. Conversely, deleting this VOT before /t/ will convert /t/ into /s/. The same changes apply to the /ʃ/-/tʃ/ and /Ʒ/-/ʤ/ pairs. The phone /ʃ/ can be converted to /s/ by removing the frequency region between 1500 and 3000 Hz. We can apply similar transformations for the /k/-/g/, /p/-/b/, /t/-/d/ and /m/, /n/, /l/, /r/, /f/, /v/ consonants. The vocal tract is also accompanied by the nasal cavity as a parallel sound wave transmission line. The nasal cavity affects the speech signal by adding a zero or anti-formant over the 1000 Hz frequency region. Therefore, nasal phones (/m/, /n/) have very little high-frequency energy. The large surface of the nasal cavity causes greater thermal loss and viscous friction, leading to larger bandwidths for nasal resonances.

In our study, we relied on the formant regions of vowels and consonants to construct AFB filter banks; however, a detailed comprehensive discussion of phones exclusively of consonants is completely beyond the scope of this paper. Interested readers may find further information in the related references of this manuscript.

In Figs. 8, 9, and 10, we present the histograms of Hillenbrand, North Texas, 20 TIMIT vowels, and 24 TIMIT consonants. We chose to represent all vowels in the TIMIT instead of the mapped ones to clarify the differences between them, if any. As seen from the histograms of Fig. 9 for the TIMIT dataset, there are significant differences between the mapped /ah/-/ax/-/axh/, /uw/-/ux/, /ao/-/aa/, and /er/-/axr/ whereas the difference between /ih/-/ix/ is not noticeable.

Fig. 8
figure 8

Histograms of spectral magnitudes of AFB filters of vowels in the a Hillenbrand and b North Texas datasets. The horizontal axis denotes the AFB filter number

Fig. 9
figure 9

Histograms of spectral magnitudes of AFB filters of vowels in the TIMIT dataset. The horizontal axis denotes the AFB filter number

Fig. 10
figure 10

Histograms of spectral magnitudes of AFB filters of consonants in the TIMIT dataset. The horizontal axis denotes the AFB filter number

Formant plots and magnitude histograms do not provide sufficient information for the discrimination of vowels; therefore, we decided to construct boxplots. Boxplot representation provides a better understanding when the median value is taken as the pivot point. Boxplots of the Hillenbrand and North Texas datasets are depicted in Fig. 11, boxplots of the TIMIT vowels are shown in Fig. 12, and TIMIT consonant boxplots are shown in Fig. 13.

Fig. 11
figure 11

Boxplots of the spectral magnitudes of the AFB filters in the a Hillenbrand and b North Texas vowel datasets. The vertical axis denotes the AFB filter number

Fig. 12
figure 12

Boxplots of the spectral magnitudes of the AFB filters of vowels in the TIMIT dataset. The vertical axis denotes the AFB filter number

Fig. 13
figure 13

Boxplots of spectral magnitudes of AFB filters of consonants in the TIMIT dataset. The vertical axis denotes the AFB filter number

Another advantage of boxplots is that they can be easily and more correctly computed from filter bank magnitudes than can formant calculations. Although we present the histograms and boxplots of consonants in the TIMIT dataset, we should emphasize that frequency regions do not help too much in consonants without detecting sharp changes during frequency formation.

4 Datasets

In this study, we used the Speech Command version 2 and the TIMIT datasets. The Speech Command version 2 contains 105,829 16-bit mono speech wave samples. Most of the samples are 1 s in length, the maximum length is 1 s, and the minimum length is 0.2133125 s; however, only 441 files are shorter than 0.5 s. Therefore, we set the sample length to 1 s and used zero padding when necessary. The total duration is 28.83 h. The second dataset is the widely used TIMIT dataset. The TIMIT dataset comprises sentences, words, and phones. In this study, we used phones only. There are 241,225 phones in the TIMIT dataset, including SA samples. The maximum phone length is 4.6428125 s, and the minimum phone length is 0.002 s. Note that in the TIMIT datasets, the silent parts are considered phones; otherwise, a single phone is not expected to be that long. The full duration of the data is 5.37 h. SCD consists of 35 different words and TIMIT comprises 39 phone classes as tabulated in Table 2.

Table 2 Number of samples in the Speech Command Dataset V2

In the TIMIT dataset, only 76 phones are longer than 1.015 s, 662 phones are longer than 0.511 s, and all are class 39 phones, which include silences and closures. There are 2590 phones longer than 0.25 s, 17,809 phones shorter than 0.0111 s and 1950 phones shorter than 0.0049 s. Therefore, we used a fixed sample point length of 4096 (~ 0.25 s) in the TIMIT dataset and padded with zero if needed. TIMIT contains 61 different phones; however, these 61 classes are mapped into 39 phones according to [69] in most of the classification applications, as depicted in Table 3. In Table 4 and Table 5, we represent the phones of the datasets used in this paper with their corresponding IPA and ARPABET symbols [44]. ARPABET is used to represent US English phones as distinct ASCII character pairs.

Table 3 Number of phones in the TIMIT dataset (mapped from 61 to 39)
Table 4 Vowels in the Hillenbrand, North Texas, and TIMIT datasets with ARPABET and IPA symbols
Table 5 All TIMIT phones with ARPABET and IPA symbols

5 Convolutional Neural Networks

CNNs (convolutional neural networks) are very powerful and successful models for image recognition and pattern classification applications. Many models have been suggested for image recognition. Moreover, they are becoming increasingly popular in signal and speech processing applications and can perform even better than LSTM networks, which were designed for time series data naturally. The catch here is that time series data can be rearranged and fed into the classifier, similar to 2D image data. In speech processing applications, a signal is confined to a fixed frame in which it is assumed to be stationary. This transforms the problem into a standard image pattern matching problem. The advent of fast GPUs has enabled researchers to train and run CNN models faster than ever. In this study, we run our experiments with the famous Visual Geometry Group (VGG16) model [109]. VGG16 comprises thirteen convolutional and max pooling layers. At the end, 2 fully connected layers are connected to a softmax classifier. All convolutional layers are equipped with a rectified linear unit (ReLU) activation function [3] and batch normalization. The VGG16 model won the ILSVR (ImageNet Large-Scale Visual Recognition) image classification and localization challenge in 2014. It is an exceptionally large network and employs over 15 million parameters in our experiments. In the original VGG16, the input is fed into the model as 224 × 224 with 3 RGB channels. In this work, we arrange our 1-D speech signal data as 2-D data and send them to the model. We also remove the last 2 max pooling layers for data structure compatibility. VGG16 is a great landmark in the quest to make computers understand what they see. The design of the implementation of our VGG16 model is shown in Fig. 14.

Fig. 14
figure 14

VGG16 convolutional neural network model implementation

6 Results

In this section, we present the results of our experiments on the Speech Command V2 and TIMIT datasets with the VGG16 model. The experiments are conducted by employing AFB filters, Mel filters, PLP filters, and MFCC feature sets. The environmental setup is built on Python 3.8.10 [100], TensorFlow 2.11.0 [1], and Keras 2.11.0 [21]. Experiments are run using Adam optimization [62] with 0.001 learning rate, \(\upbeta 1=0.9,\upbeta 2=0.999\), 100 epochs for the Speech Command dataset and 30 epochs for the TIMIT dataset. The data is split into 70% training and 30% test sets. The feature extraction phase is implemented in MATLAB 2019a [76] with the Auditory Toolbox [110]. PLP is implemented using the Rastamat package of Mark Shire and Dan Ellis [93]. Speech signals are dissected into 25-ms frames with 10-ms overlapping steps. We also incorporated the first-order delta acceleration coefficients in our feature sets. No data augmentation is performed on our datasets.

The feature extraction is run with 400 sample (25 ms) point frames and 160 (10 ms) sample point window shifts, thus creating a 24-frame 1D feature vector for TIMIT and 100-frame 1D feature vector for SCD dataset. This 1D feature vector is converted into 2D matrix form in the dimension of \(\left(24\times feature\_count (TIMIT) , 100\times feature\_count (SCD)\right)\) while being fed into the VGG network. There are 22 features in the AFB filters, 26 in the MFCC filters, 42 in the PLP filters, and 80 in the Mel filters, including first-order delta acceleration coefficients. All speech signals are processed with Hamming window and pre-emphasis preprocessing with\(\alpha =0.97\). In the computations of the Mel and MFCC filters, the lowest frequency is 100 Hz, the linear spacing is 66 Hz, the number of linear filters is 13, the number of log filters is 27, and the log spacing is 1.0711703.

The classification results are tabulated in Table 6 for the Speech Command dataset and in Table 7 for the TIMIT dataset. AFB outperforms MFCC by a significant margin in both datasets and runs shoulder-by-shoulder with Mel and PLP filters in the Speech Command V2 dataset. AFB also converges better than does MFCC and strongly competes against Mel filters, as illustrated in the training accuracy graph of Fig. 15. AFB is also less susceptible to overfitting than are the MFCC, Mel, and PLP filters due to the smaller number of banks. We need to bear in mind that AFB filters contain only 11 coefficients compared to 40 Mel filters, 21 PLP filters, and 13 MFCCs.

Table 6 Classification results (% ACC) on the Speech Command Dataset V2 with the VGG16 network. The number of features is shown in parentheses
Table 7 Classification results (% ACC) on the TIMIT dataset with the VGG16 network. The number of features is shown in parentheses
Fig. 15
figure 15

Comparison of accuracy between AFB filters, Mel filters, and MFCC on the Speech Command Dataset V2 with VGG16 and the Adam optimizer. AFB converges better than MFCC and PLP and challenges Mel filters very strongly

We continue our quest to explore the best filter bank strategy and try various filter banks with different numbers of filters on the Speech Command V2 and TIMIT datasets with the VGG16 architecture. We attempt to increase the number of filters from 6 to 25 and introduce the results in Figs. 16 and 17. Selecting a filter count between 9 and 13 is a smart choice for implementing speech processing. The region below 9 (yellow) can be considered an underfitting area, and the region above 16 (red) is the area where overfitting concerns begin to emerge. Therefore, we have finalized our quest with 11 filters for the AFB filters to minimize the number of features.

Fig. 16
figure 16

Accuracies according to the number of filters on the Speech Command Dataset V2 with VGG16

Fig. 17
figure 17

Accuracies by the number of filters on the TIMIT dataset with VGG16. The region below 9 filters (yellow) points to underfitting, and the region above 16 filters (red) insinuates overfitting

7 Conclusions and Future Directions

In this study, we conducted experiments with novel AFB filters and compared them with Mel filters, MFCC and PLP features using the TIMIT dataset and the Speech Command Dataset version 2. The novel AFB filters always outperformed the MFCC in all the experiments and achieved accuracies comparable to those of the Mel filters in the Speech Command Dataset V2 when utilizing the famous VGG16 model. The results suggest that Mel filters, MFCC or PLP features can be replaced with novel AFB filters in speech processing applications. Using 40 banks in Mel filters seems unnecessary. The novel filter banks are computationally far less expensive than Mel filters. AFB contains only 11 filters compared to 40 Mel filters or 13 MFCCs and can be extracted faster. In this study, we evaluated different filter banks with up to 25 subbands. Some of these filter banks have nearly equal performances (filters with 17 and 19 subbands) with the Mel filters and can be used where high accuracy is the main objective. However, as in the Mel filters case, they can also be subject to overfitting concerns. We should also take into account the unbalanced nature of the TIMIT dataset, particularly regarding class 39. Our long-standing research points out that the number of filters should be between 9 and 13; however, models with 17 or 19 subbands look like other strong candidates. As we discussed in Sect. 3, AFB filters enable filter bank summation methods and the use of wavelet filters such as Fejér–Korovkin or quadrature mirror filters for constructing nonuniform marginally overlapping trapezoidal filters, which will enable reconstruction of the signal using fast filter bank implementation algorithms. Moreover, AFB filters have proven to be powerful representations for speech processing due to their strong and natural foundation and may usher in new methods for speech processing applications. In our experiments, TIMIT is used as a database of phones, and the Speech Command dataset is used as a command dataset. This may be one of the effects of the results in the TIMIT dataset. Consonants cannot be segregated solely by frequency features, which is a well-established issue. If we can find a better representation for detecting voice onset time, phone boundaries, and consonant transitions, AFB filters may excel further. The performance of Mel filters as well as other filter banks with a large number of subbands in the TIMIT dataset is really intriguing and requires more sophisticated investigation. More research is needed here, particularly cross-corpora investigations to examine the generalization abilities of AFB filters, Mel filters, MFCC, and PLP or more accurate learning models may help. There is also a low possibility of creating more compact filter banks with fewer than 11 frequency bands. In future works, we will assess the effects of AFB filter banks on large vocabulary continuous speech recognition applications and other areas of speech processing, such as emotion recognition, speaker identification, gender detection, and speaker diarization, with advanced deep network models.