1 Introduction

Automatic speaker verification (ASV) is a biometric task of verifying the claimed identity of a speaker. It uses the characteristics of human voice/speech for authentication of the claimant. If the test speech sample, given at the time of verification, is closer to the target model (template), then the ASV system pronounces the claim to be genuine, else the speaker is declared an impostor. As compared to the other competing biometric technologies, voice biometrics stands strong due to its prompt, hassle-free, and error-free authentication. It revamps the surveillance by minimizing security breaches caused by compromised or stolen passwords, phishing, fraudsters, etc. Needless to say, these cyber scams and frauds can play havoc with anyone accessing online application tools ignorantly. Voice biometrics enables the system to spend less time in authenticating users and resetting passwords. ASV technology provides a low-cost biometric solution [18] and is, thus, increasingly gaining acceptance in remote access to applications including but not limited to banking and financial services, websites and networks, telephone and internet transaction authentication, audio signatures for digital documents, hands-free mobile authentication, authentication during a customer support call, biometric login, payment gateways, merchandising, forensics, healthcare and mobile workforces, social networking websites, e-games and e-learning tools. With the ever-growing need for surveillance and secured systems, ASV systems are destined to be ubiquitously present and a provisioner of a much-needed security shield to adults and children alike. Dismally, a major chunk of the works reported in the literature deals with the task of building an ASV system for the adult population. The literary works reported on building an ASV system for the more vulnerable lot, i.e., children, are regrettably unsubstantial [25, 28, 32].

State-of-the-art ASV systems are found to be very effective, incurring minimal error when they are fed with an adequately larger amount and longer duration of speech data. Apprehensively, most of the children’s speech corpora are not easily available. Moreover, these are available in only a handful of languages spoken across the globe and limited in terms of hours of data. For the languages in which children’s speech corpus is unavailable (zero-resource condition), developing an ASV system is quite a formidable task. Even if a limited amount of children’s speech data is on offer (low-resource condition), developing an effective children’s ASV system employing deep learning architectures is still very challenging. State-of-the-art ASV systems incorporate deep learning architectures that require estimating a huge number of parameters. This, in turn, requires a large amount of domain-specific data. To overcome with the issue arising with the low- and zero-resource conditions, a few earlier works on children’s ASV have studied the impact of synthetically generating speech data and then pooling it into training. Out-of-domain data augmentation has been reported to be effective in this regard [28].

An ASV system, in real-world applications, is also marred by constrained duration of speech utterances. Though this requirement can be fulfilled during training phase by some data augmentation techniques, it is not feasible to do the same during the testing phase. In forensics applications for instance, the employed ASV system is less likely to get sufficient data even for enrollment. In access control type cases, average utterance length is restricted to a few seconds only [17]. To the best of the authors’ knowledge, there is hardly any work reported on children’s ASV task using short utterances. Motivated by this gap in the research arena, the role of out-of-data augmentation techniques in the context of short utterance-based children’s ASV task is explored in this paper. The effect of synthetically generating speech data from the available adults’ speech corpus, which is acoustically similar to that of children’s speech prior to augmentation, is analyzed in this study. The techniques used to address the dearth of domain-specific data explored in this paper include (i) voice conversion (VC) of adults’ speech data using cycle-consistent generative adversarial network (C-GAN) [10], (ii) prosody modification (PM) [26, 27] of adults’ speech, and (iii) up-scaling the formant frequencies (FM) [11, 14] of adults’ speech data. All the explored techniques modify the attributes of adults’ speech in order to render it acoustically similar to children’s speech. The explored out-of-domain data augmentation technique is observed to be very effective as demonstrated through the experimental studies presented in this paper.

In addition to data augmentation, the effectiveness of frame-level concatenation of the most popular front-end acoustic feature, namely the MFCC with the GTF-CC or with the IGTF-CC, is also examined in this paper. In general, the MFCC are the most commonly used front-end acoustic features, which can capture speaker-specific characteristics. However, ASV systems based solely on MFCC features show susceptibility in a number of scenarios. Firstly, the performance of MFCC-based systems degrades drastically in the presence of background noise [6]. Secondly, the mel-scale in the standard MFCC is not the optimal auditory model [6]. Lastly, since the resolution of Mel-filter-bank decreases as the frequency is increased, the performance of MFCC-based ASV systems degrades in case of high-pitched speakers. The aforementioned facts have fuelled the authors’ of this paper to delve into the role of another well-acclaimed front-end speech parameterization technique, namely the GTF-CC. Prior literary works have demonstrated that GTF-CC performs robustly in speaker verification tasks in the presence of additive noise over a wide range of signal-to-noise ratios [6]. Further, Gamma-tone filter-banks employed in the extraction of GTF-CC features exploit the advantages of human auditory system [22] as it is more subtle and similar to human auditory model. The Gamma-tone filter-bank is very similar to the rounded exponential function used in representing the magnitude response of the human auditory filters [8]. The Mel-filter-bank on the other hand is designed to model the human pitch perception mechanism [9]. The feature fusion model of MFCC and GTF-CC is thus expected to enhance the ASV performance. The authors in this backdrop have also explored a variant of the GTF-CC, namely the IGTF-CC. In this case, the employed filter-bank is obtained simply by flipping the Gamma-tone filter-bank around the midpoint. The Inverse Gamma-tone filter-bank is thus supposed to have better frequency resolution in higher-frequency range. The lower-frequency components are down-sampled. As already highlighted, the use of Mel-filter-bank down-samples the spectral information in the higher-frequency range. The IGTF-CC due to its complementary nature of filter-bank is supposed to better capture the acoustic information in children’s speech, which are otherwise averaged out by the MFCC features. The feature fusion model of MFCC and IGTF-CC is thus expected to outperform the traditional MFCC, leading to an enhanced children’s ASV system. It is worth mentioning here to the best of the authors’ knowledge that the role of IGTF-CC has not yet been explored in the context of children’s speaker verification.

Fig. 1
figure 1

Block diagram outlining the data augmentation and feature concatenation approaches proposed in this work in order to enhance the verification performance of a short utterance-based children ASV system

The aforementioned proposal of feature concatenation in addition to data augmentation is outlined in Fig. 1 and well validated in the experimental results section of the paper. The paper also illustrates the age group-wise as well as a gender-wise analysis of the children ASV performance to unravel the effect of data augmentation and feature concatenation. The proposed approach aids in diminishing the Equal Error Rate (EER) and Detection Cost Function (DCF) considerably as opposed to our baseline system trained exclusively on children’s speech using MFCC features. The ASV systems for children’s speech developed in this work for experimental evaluations employ x-vector-based speaker representation along with probabilistic linear discriminant analysis (PLDA)-based scoring.

The noteworthy contributions of this study are delineated as follows:

  • A comprehensive examination of the children’s speaker verification task using short utterances under low-resource conditions. As emphasized, there is a notable scarcity of research addressing the children’s ASV task centered on short utterances;

  • The efficacy of the suggested data augmentation strategy in mitigating the adverse impact resulting from the scarcity of domain-specific data is illustrated;

  • The significance of the proposed frame-level concatenation of front-end acoustic features in preserving higher-frequency contents within children’s speech data is emphasized and substantiated;

  • An exhaustive examination of the children’s ASV system is conducted, categorizing subjects by age group and gender, to assess the cumulative effects of data augmentation and feature concatenation.

The rest of this paper is organized as follows: Sect. 2 describes the proposed out-of-domain data augmentation techniques to deal with the scarcity of domain-specific data. In Sect. 3, we have talked about the authors’ motivation to look beyond the traditional Mel-based filter-bank and delve into the scope of feature concatenation for children’s ASV system. The experimental evaluations exhibiting the efficacy of our proposed technique are presented in Sect. 4. Eventually, conclusion and the future scope of the research work done in this paper are mentioned in Sect. 5.

2 Proposed Out-of-Domain Data Augmentation Technique

The state-of-the-art ASV system makes use of x-vectors-based speaker representation. For extracting x-vectors, a time-delay neural network (TDNN) [16, 30, 31] comprising a large number of hidden layers and hidden nodes per layer is trained. As already mentioned, one of the hurdles in the development of a reliable ASV system for children is the scarcity of domain-specific data. Hence, training an x-vector extractor on a limited amount of children’s speech will result in sub-optimal performance. Out-of-domain data augmentation techniques can help mitigate this obstacle. However, it is worth highlighting here that the augmented data must have attributes similar to those of children’s speech. Otherwise, the trained ASV system would fail to generalize well for unseen child speakers. Driven by this rationale, we present an out-of-domain data augmentation technique which is observed to be very effective in the context of children’s ASV task using short utterances.

The proposed out-of-domain data augmentation technique is pictorially summarized in Fig. 1. In our approach, we use a limited amount of adults’ speech for augmentation. As already mentioned, we have employed different techniques by which the acoustic attributes of adults’ speech can be suitably modified and those are briefly discussed in the following:

In the first technique, the adults’ speech was subjected to voice conversion (VC) using a cycle-consistent generative adversarial network (C-GAN) [10]. Nearly 10 min of speech data from each speaker group (adult and child speakers) was used to train the C-GAN. As a result of VC, adults’ speech utterances sound very similar to children’s speech as noted during the listening tests. Therefore, on pooling the voice-converted data, the issues of acoustic mismatch reduce to a large extent.

In the second technique, adults’ speech was subjected to prosody modification prior to augmentation. It is well known that the pitch for children’s speech is higher while the speaking rate is slower [14, 24]. Therefore, pitch of the speech data from the adult speakers was increased by a factor of 1.35 while the duration was increased by a factor of 1.4. These scaling factors were determined from earlier reported works on children’s speech recognition [26]. In order to perform prosody modification (PM), the technique reported in [21] was used. Again, pooling prosody-modified data ensure that the acoustic mismatch remains in check.

In the case of child speakers, the formant frequencies are higher as compared to adult speakers [14, 24]. Hence, in the third technique, the formant frequencies (FM) of adults’ speech data were up-scaled by a factor of 0.08. At the same time, the speaking rate of adults’ speech data was decreased by a factor of 1.4 through time-scale modification (TSM) [21]. This was done to compensate for the differences in speaking rates as discussed earlier. The mentioned scaling factors were adopted from the earlier works [13, 26]. Like in the case of VC and PM, pooling formant modified adults’ data help in increasing the amount of training data while keeping acoustic mismatch in check to a large extent.

All the modified versions of adults’ data are then pooled into training along with children’s speech as well as the original adults’ data. Consequently, the amount of training data is increased leading to a more robust estimation of model parameters. Moreover, modifying the acoustic attributes ensures that the developed ASV system does not get biased toward adult speakers. It is worth mentioning here that even though the aforementioned techniques of synthetically generating speech data are well acclaimed in literary works, their combined effectiveness in the context of children’s ASV systems for short utterances is relatively uncharted.

3 Exploring the Role of Different Acoustic Features in Children ASV

3.1 Prior Art

As mentioned earlier, the MFCC features are one of the most popular and commonly used front-end acoustic features in the context of an ASV system. It is the first feature among the three front-end features explored in this paper. The step-wise process of extracting MFCC features is briefly described as follows:

  • The speech signal is first high-pass filtered through a pre-emphasis filter in order to emphasize the higher-frequency components;

  • Next, each of the speech utterances is analyzed into short-time frames using overlapping Hamming windows, followed by the computation of short-time Fourier transform (STFT);

  • Spectral warping is then carried out using a set of non-linearly spaced filters, called Melody(Mel)-filter-bank. Mel-filter-bank is a set of triangular Mel-weighted filters as depicted in the top panel of Fig. 2;

  • Logarithmic compression of the filtered power spectrum is then performed;

  • The decorrelated real cepstrum (RC) is then obtained by applying discrete cosine transform (DCT);

  • Finally, by low-time liftering of the real cepstrum, MFCC features are extracted which will eventually be fed as input for training any classifier.

Fig. 2
figure 2

Configuration of a Mel-, Gamma-tone and Inverse Gamma-tone filter-banks

Fig. 3
figure 3

Block diagram outlining the process of extracting concatenated MFCC features with either GTF-CC or IGTF-CC features

Fig. 4
figure 4

Spectrograms corresponding to speech data from adult male (top panel), adult female (middle panel), and child (bottom panel) speaking the word HEED. The red speckles are the contours denoting the variation in formant frequencies, while the blue line denotes the pitch frequency variations (Color figure online)

3.2 Motivation for Exploring the Role of Feature Concatenation in Children ASV

MFCC features are the most conventional front-end acoustic features and have been the state of the art ever since its inception. They provide a compact and stable representation of the vocal tract of a speaker, significantly reducing the computational cost. The limitations of the MFCC features discussed in the previous section provide the necessary impetus toward the exploration for alternative front end acoustic features, namely the GTF-CC and the IGTF-CC.

3.2.1 Frame-Level Concatenation of MFCC and GTF-CC Feature Vectors

The Gamma-tone filter-banks are well known to better model the human auditory system [6, 15]. The Gamma-tone filter has a more smooth form and are placed in equal distance in frequency, in stark contrast with the Mel-filter-banks, as shown in the middle panel of Fig. 2. Moreover, the amount of overlap of a Mel-filter-bank is fixed, so if the number of filters increases, the bandwidth of each triangular filter will decrease. On the other hand, the bandwidth of a Gamma-tone filter is determined by its center frequency. So, if the number of filters increases, the overlap also increases. Theoretical and experimental results in [4] demonstrate that the filter bandwidth is one of the vital factors affecting speaker recognition performance in noise. Further the authors in [6], with the help of spectrograms, have analyzed the performance of ASV system subjected to a noisy speech utterance. The MFCC spectrogram is found to show robustness only at low frequencies. The GTF-CC spectrogram on the other hand showed robustness in both low and high frequencies, suggesting that GTF-CC features can play an influential role while dealing with child speakers. This has paved the idea of concatenating the MFCC and GTF-CC feature vectors to analyze the performance of short utterance-based children’s ASV system.

The block diagram outlining the extraction process of the concatenated MFCC and GTF-CC features is shown in Fig. 3. The GTF-CC features are extracted in just the same way as the MFCC features discussed earlier, the only replacement being the Gamma-tone filter-bank in place of Mel-filter-bank. Both the MFCC features and the GTF-CC features are extracted using the Kaldi toolkit. Given the speech signal, first, we extract the MFCC and GTF-CC features. Next, for each of the short-time frames, the corresponding MFCC and GTF-CC features are appended. The resulting feature vectors (concatenated MFCC+GTF-CC feature vectors at the frame level) are then used as the input to the x -vector extraction process instead of the MFCC features. The experimental evaluations in the later portion of this paper demonstrate that an ASV system trained after concatenating GTF-CC features with MFCC features performs better than the one trained on MFCC features alone. However, it is worth highlighting that due to the inherent nature of filter-banks used in the feature concatenation of MFCC and GTF-CC, the spectral information in the higher-frequency range of the children’s speech will be down-sampled. The quest for the preservation of higher-frequency contents in children’s speech led us toward the exploration of another front-end acoustic feature, namely the IGTF-CC.

3.2.2 Frame-Level Concatenation of MFCC and IGTF-CC Feature Vectors

As already mentioned earlier, a significant amount of germane spectral information is present in the higher-frequency region in case of children. The spectrogram corresponding to speech data from children, in the bottom panel of Fig. 4, shows significant power even in the 4–8 kHz. Further to that, earlier literary works suggest that the formant frequencies are up-scaled in the case of child speakers [7, 12], which is also quite prominently visible in the spectrogram of children speech in the bottom panel of Fig. 4. The spectrogram corresponding to speech data from adult male (top panel) and adult female (middle panel) is also plotted for comparison in Fig. 4. Mel-scale warping is inspired from the findings of psychoacoustics; it is based on the premise that human perception of pitch is linear up to 1000 Hz and then becomes nonlinear for higher frequencies (somewhat logarithmic) [3]. The Mel-filter-bank provides better resolution to speech signals in the low-frequency range, while its frequency resolution deteriorates in the high-frequency range, as illustrated by the nature of its filter-bank in the top panel of Fig. 2. When dealing with speech from children, the down-sampling of spectral information in the high-frequency range is a pitfall [7, 24]. Thus, preservation of spectral information in the higher-frequency range as well as the pursuit for filter-banks which best describes the human auditory system becomes our top-notch priority and persuades us to look for solutions beyond the traditional Mel-based filter-bank and Gamma-tone filter-bank for our high-pitched speakers. Motivated by this cognizance, the role of Inverse Gamma-tone filter-bank is delved into in this paper for the development of a robust children’s ASV system.

The Inverse Gamma-tone filter-bank is realized simply by flipping around the Gamma-tone filters about the middle point of the frequency axis, as depicted in the bottom panel of Fig. 2. The front-end acoustic features achieved by replacing the Mel-filter-bank with Inverse Gamma-tone filter-bank are referred to as IGTF-CC features. This configuration of the filter-bank results in a better resolution of the spectral information in the high-frequency region, and thus, the Inverse Gamma-tone filter-bank is supposed to capture the acoustic information missed by the MFCC features. It is worth highlighting here that the Inverse Gamma-tone filter-bank is just a variant of Gamma-tone filter-bank, implying that it’s filter-bank has the same smooth structure and whose bandwidth is decided by its center frequency just the same. So, if the number of filters increases, the overlap also increases. The Inverse Gamma-tone filter-bank though results in poor resolution to the lower-frequency components. Therefore, we have conceived the idea of concatenating the MFCC and IGTF-CC feature vectors in order to effectively preserve both the low- and high-frequency components. The block diagram outlining the extraction process of the concatenated MFCC and IGTF-CC features is also represented in Fig. 3. The IGTF-CC features are extracted in just the same way as the MFCC features discussed earlier, the only replacement being the Inverse Gamma-tone filter-bank in place of Mel-filter-bank. Both the MFCC features and the IGTF-CC features are extracted using the Kaldi toolkit. Given the fully augmented speech signal (employing the proposed out-of-domain augmentation technique), we firstly extract the MFCC and IGTF-CC features. Next, for each of the short-time frames, the corresponding MFCC and IGTF-CC features are appended. The resulting feature vectors (concatenated MFCC+IGTF-CC feature vectors at the frame level) are then used as the input to the x-vector extraction process instead of the MFCC features. The experimental evaluations in the later portion of this paper demonstrate that an ASV system trained after concatenating IGTF-CC features with MFCC features outperforms not only the ASV system trained on MFCC features alone, but also the one trained on the frame level fusion of MFCC and GTF-CC feature vectors.

Fig. 5
figure 5

Canonical correlation analysis of various feature concatenation explored in the paper

3.3 Canonical Correlation Analysis (CCA)

In order to substantiate the effect of feature concatenation, the canonical correlation analysis (CCA) was carried out. We have computed the canonical correlation among MFCC, GTF-CC, and IGTF-CC features as shown in Fig. 5. The CCA plot in the top panel of Fig. 5 shows how closely the MFCC and GTF-CC feature vectors are correlated for majority of the coefficients barring the last few coefficients. This explains for the inability of the concatenated MFCC and GTF-CC features to be able to capture the diverse range of acoustic attributes in children’s speech. The CCA plot in the bottom panel of Fig. 5 shows that the MFCC and IGTF-CC feature vectors are highly uncorrelated or less correlated for most of the coefficients barring the starting few coefficients. Therefore, the frame-level concatenation of MFCC and IGTF-CC features leads to a wider range of acoustic attributes being captured. The inherently different configuration of filter-banks employed in the extraction of MFCC and IGTF-CC features is the main force behind this development. Thus, the CCA plot of IGTF-CC and MFCC reinforces the complementary characteristic of IGTF-CC with respect to MFCC which helps the duo in better capturing the acoustic information in children’s speech.

4 Experimental Evaluations

In this section, the relative effectiveness of MFCC, concatenated MFCC and GTF-CC, concatenated MFCC and IGTF-CC features is explored and the experimentally verified results are presented.

4.1 The Speech Corpora

Four different speech corpora were employed for the development and evaluation of speaker verification system for children. These English-based speech corpus includes CSLU kids corpus [29], CMU kids corpus [5], PF-STAR kids corpus [1], and WSJCAM0 corpus [23].The details of each of these data sets are succinctly summarized as under, and a figurative tabulation of it is shown in Table 1:

  1. 1.

    CSLU kids corpus: This data set consists of spontaneous and prompted speech comprising of 100 h of data having 73, 100 utterances from 1100 children. The speech contribution is from children hailing from kindergarten to grade 10. Their speech data are sampled at a sampling rate of 16 kHz. This speech corpus is used as the training data for the ASV system in this work.

  2. 2.

    CMU kids corpus: This data set comprises of 9.1 h of data having 5180 utterances from 76 children. The child speakers are in the age group of 6–11 years. The sampling rate of this speech corpus is also 16 kHz, and it serves as our test set. A total of 423, 388 genuine trails and 26, 403, 832 impostor trails are present in this data set. The average duration of the data in this corpus is 6 s. Therefore, evaluation on this set represents the short utterance case.

  3. 3.

    PF-STAR kids corpus: It is an 8.3 h of data with utterances from 121 children speakers. The age group varies from 4 to 14 years. To maintain the uniformity of sampling rate with rest of our corpus, this data set has been down-sampled to 16 kHz from 44.1 kHz rate. This is our test data set for long utterances. It comprises 6, 664 genuine trails and 995, 420 impostor trails, respectively. The average duration of the data in this corpus is 30 s. Therefore, the evaluation on this set is for contrast in order to demonstrate the severity of the problem when short utterances are used.

  4. 4.

    WSJCAM0 corpus: It is an adults’ speech corpus used for the out-of-domain data augmentation. This speech corpus comprises of 15.5 h of unperturbed data having 7852 utterances and 132, 778 words from 92 adult speakers (male and female). The sampling rate is 16 kHz.

Table 1 Details of different data sets used in this work for training and testing phase of the children’s ASV system

4.2 Experimental Setup

The entire setup of the ASV system was developed and examined using the Kaldi toolkit [19]. In the process to extract the three kinds of aforementioned front-end features, speech data were first high-pass filtered having pre-emphasis factor of 0.97. It is a well-known fact that speech data are non-stationary in nature, so each of the speech utterances was first analyzed into short-time frames using overlapping Hamming windows. The duration of these overlapping Hamming windows was chosen to be 25 ms with a frame shift of 10 ms. For each of the three front-end features, a 30-channel filter-bank was employed to extract 30-dimensional base features. For MFCC features, a 30-channel Mel-filter-bank was engaged for warping the power spectra to Mel-scale, before computing the 30-dimensional MFCC features. For extracting the GTF-CC features, a 30-channel Gamma-tone filter-bank was engaged for warping the power spectra, before computing the 30-dimensional GTF-CC features. Finally, for the computation of the IGTF-CC features, a 30-channel Inverse Gamma-tone filter-bank was superimposed over the power spectrum, before computing the 30-dimensional IGTF-CC features.

Description of out-of-domain data augmentation: The out-of-domain training set used for developing the children’s ASV system was derived from an adult’s speech corpus called as WSJCAM0 corpus. This training data set consists of original adult speech data derived from both male and female speakers and is referred to as ADULT. Three newer versions of speech data are synthetically generated from this speech corpora and are enlisted as follows:

  1. i.

    ADULT-VC: This data set was generated by applying voice conversion to the adult data through a cycle-consistent generative adversarial network (C-GAN). The GAN underwent training using a 10-min speech data set encompassing both adult (source) and child speakers (target). The number of epochs utilized in training the C-GAN parameters was set at 5000;

  2. ii.

    ADULT-PM: This data set was generated by increasing the duration of the speech data of ADULT by a factor of 1.4 while the pitch of ADULT was enhanced by a factor of 1.35. To perform time-scale modification, the technique of audio stretching was applied, leveraging the methodology of fuzzy classification of spectral bins (FCSB) [2];

  3. iii.

    ADULT-FM-TSM: This data set was generated by up-scaling the formant frequencies of ADULT speech by a factor of 0.08. At the same time, the speaking rate of adults’ speech data was decreased by a factor of 1.4 through time-scale modification.

After performing the aforementioned data modification techniques namely, voice conversion (VC), prosody modification (PM), formant and time-scale modification (FM-TSM), a total of 63 h of synthetic data was available for training purpose with acoustic attributes similar to those of children’s speech.

For the extraction of highly discriminative speaker representations, a deep neural network was utilized. These fixed-dimensional speaker embeddings called as x-vectors were extracted from a time-delay neural network (TDNN) architecture [16, 30, 31]. This architecture consists of 7 hidden layers and undergoes training for 6 epochs. The TDNN architecture is structured into three integral components: the frame-level, statistics-level, and segment-level components. Within the frame-level component, spanning layers 1–5, input features sequentially traverse these layers, effectively capturing temporal information and enhancing the temporal context of the frames under consideration. The statistics-level component serves the purpose of converting variable-length speech inputs into a singular, fixed-dimensional vector. This component consists of a single layer, called the statistics pooling, which amalgamates the output vectors from the TDNN’s frame-level and computes their mean and standard deviation. Concurrently, the segment-level component is responsible for attributing speaker identities to the segment-level vector. The mean and standard deviation, post-concatenation, are transmitted to two additional hidden layers, subsequently leading to a softmax output layer. Layer 6 operates as the speaker embedding, which transforms the information from the preceding layer into a low-dimensional representation. This intricate arrangement of components and layers underscores the comprehensive design of the TDNN architecture and its efficacy in processing speech inputs at varying levels of abstraction. The training of network parameters was conducted utilizing the stochastic natural gradient descent algorithm [20, 31]. Finally, each of the speech utterances was represented as a 512-dimensional x-vector. The scoring process was executed through the utilization of x-vectors in conjunction with the trained PLDA model. When provided with two per-utterance embeddings, denoted as \(e_{i}\) and \(e_{j}\), the PLDA computes a log-likelihood ratio (LLR) to quantify the likelihood associated with the pair of embeddings. The LLR is calculated in the following manner:

$$\begin{aligned} \text {LLR}(e_{i},e_{j}) = \log \left[ \frac{ P\left( \frac{e_{i},e_{j}}{H_{1}}\right) }{P \left( \frac{e_{i},e_{j}}{H_{0}}\right) } \right] \end{aligned}$$
(1)

where \(H_{1}\) represents the hypothesis related to the same speaker, while \(H_{0}\) pertains to the hypothesis associated with different speakers. The PLDA model calculates a log-likelihood ratio for each speaker pair, representing the level of similarity between the individuals. In instances where the pair shares the same label, a high score is anticipated, signifying identical speakers (a genuine claim). Conversely, when the pair bears different labels, a low score is expected, indicating different speakers (an imposter). The metrics used for performance measure were equal error rate (EER) and minimum decision cost function (minDCF).

4.3 Experimental Results

This study was carried out to monitor how the performance of an ASV system, trained on a mix of a large amount of children’s speech data and an adequate amount of original as well as modified adult’s speech corpus, is affected when subjected to short utterances of children’s speech. The EER and minDCF values for the employed ASV system are given in Table 2. When subjected to short utterances, a relative improvement of \(33.6\%\) with respect to the baseline system trained solely on child data set is achieved when the proposed data augmentation techniques are applied. This shows that the proposed data augmentation technique is very effective. The EER and minDCF values, when the employed ASV system is tested with long utterances of children’s speech test set, are also enlisted for comparison. As can be seen from Table 2, the EER of the baseline system climbs from \(6.38\%\) (for long test utterances) to \(21.95\%\) (for short test utterances). Further, when the proposed out-of-domain data augmentation has been employed, the EER of ASV system climbs from \(3.824\%\) (for long test utterances) to \(14.58\%\) (for short test utterances). This shows the magnanimity of the challenge posed by short test utterances on the ASV performance. At the same time, it is imperative to realize and appreciate that the proposed data augmentation approach takes the edge off the detrimental effect of short utterance speech test set.

Next, the effectiveness of the frame-level concatenation of the front-end acoustic features in the light of the employed short utterance-based ASV system was examined. The EER and minDCF values obtained when MFCC and GTF-CC features are concatenated, as well as for MFCC and IGTF-CC feature fusion given in Table 3. In this case, the proposed data augmentation technique has been employed prior to training the ASV system. The EER and minDCF values obtained when MFCC features are used alone are also enlisted for comparison. As evident, an absolute reduction of \(0.62\%\) in EER is achieved by concatenation of MFCC with GTF-CC features, while the concatenation of MFCC and IGTF-CC features yields an absolute reduction in EER of \(1.08\%\). The detection error trade-off (DET) plot summarizing these results is shown in Fig. 6. In this plot, baseline refers to the ASV trained exclusively on children’s speech using MFCC features.

Table 2 EER and minDCF values for the short and long utterances of children’s speech test set demonstrating the effectiveness of out-of-domain data augmentation techniques
Table 3 EER and minDCF values for the short utterance-based ASV system trained on the data set obtained using the proposed out-of-domain data augmentation technique demonstrating the effectiveness of feature concatenation
Fig. 6
figure 6

Detection error trade-off plot demonstrating the effectiveness of proposed feature concatenation

To access the effectiveness of the proposed approach in a more comprehensive manner, an age-wise analysis as well as gender-wise analysis of children’s speech was performed. For evaluating the effect of age variation, the evaluation metric results are reported for the entire test set, as well as after doing the age-wise break-up of the test set in two subgroups. The EER and minDCF values for this experimental study are given in Table 4. In this case as well, the proposed out-of-domain data augmentation has been employed before training the x-vector extractor. The first subgroup comprised speech utterances from speakers belonging to the age-group 6–7 years while the second subgroup comprised of speech utterances of speakers hailing from the age-group 8–9 years. As evident from the enlisted results, a significant degradation (reflected in the poor values of EER) is noted for children in the lower age-group (6–7 years) compared to the children in the higher age-group (8–9 years) or against the children in the full test set. This degradation is due to higher formant frequency and pitch frequency of children’s speech due to their inherent shorter vocal tract length. As the children grow, the formant frequencies decrease as well as the speaking rate tends to stabilize.

Table 4 Age group-wise break up of EER and minimum DCF values highlighting the significance of feature concatenation approaches

Further, it is noteworthy that the ASV system trained solely on the MFCC features performs poorly in terms of evaluation metrics as it down-samples the higher-frequency contents of children’s speech. The children’s ASV system trained on the concatenated acoustic features yields better results, and this improvement is more profound in the lower age-group as the concatenation of GTF-CC/IGTF-CC features with the MFCC features takes into account the spectral information in the lower- as well as higher-frequency regions. The EER for the full test set shows a relative reduction of \(7.41\%\) when MFCC and IGTF-CC features are concatenated, pictorially depicted by the first bar in Fig. 7. The corresponding relative reduction in error of the same concatenated features for the age group 6–7 years is 11.32%, pictorially depicted by the second bar in Fig. 7. For the age group between 8 and 9 years, the relative reduction in EER is \(8.26\%\), pictorially depicted by the third bar in Fig. 7.

Fig. 7
figure 7

Bar graph representation of the relative reduction in EER(%) for various speaker groups(in terms of age and gender) corresponding to the ASV system trained on the concatenated MFCC and IGTF-CC features as compared to an ASV system trained on the MFCC features alone. The bar depicted in red shows the greatest relative improvement in EER

Table 5 Gender-wise breakup of EER and minimum DCF values highlighting the significance of feature concatenation approaches

Finally, the effect on the performance of the employed ASV system was evaluated due to gender-wise grouping of the children’s speech test set. The EER and minDCF values for this experimental study are given in Table 5. As evident from the enlisted results, a significant degradation (reflected in the poor values of EER) is noted for the female children as compared to the male children or when compared with the children in the full test set. This degradation is due to higher formant frequencies of female children’s speech compared to their male counterparts. Further, as evident from the table, the EER considerably reduces when either GTF-CC features or IGTF-CC features are concatenated with MFCC features. The EER for the full test set shows a relative reduction of \(7.41\%\) when MFCC and IGTF-CC features are concatenated, pictorially depicted by the first bar in Fig. 7. The corresponding relative reduction in error of the same concatenated features for the female child is \(7.13\%\), pictorially depicted by the fourth bar in Fig. 7. For the male child, the relative reduction in EER is \(7.66\%\) as compared to the baseline, pictorially depicted by the fifth bar in Fig. 7.

5 Conclusion and Future Research Direction

The work in this paper sets forth our endeavor toward the development of a robust children’s ASV system using short utterances under low-resource conditions. To address the inevitable problem of speech data paucity, an out-of-domain data augmentation technique is proposed to synthetically generate more data for training. Out-of-domain data augmentation approach helps in widening the diversity of the captured acoustic attributes, by introducing missing desirable characteristics while keeping the acoustic mismatch in check. Together with data augmentation, the effectiveness of frame-level concatenation of MFCC with the GTF-CC/ IGTF-CC is also analyzed in this paper. In GTF-CC or its variant, the IGTF-CC features are well known to better model the human auditory system and are more resilient to additive noise compared to the traditional MFCC features. Additionally, the complementary nature of filter-bank in the IGTF-CC with respect to MFCC helps in preserving spectral information in the higher-frequency range. Thus, MFCC features in tandem with the IGTF-CC features help not only in modeling the human auditory model in a more competent manner, but also in preserving the spectral information in low- as well as high- frequency range. Furthermore, age- and gender-wise analyses were carried out to study the combined effect of data augmentation and feature concatenation on the ASV system performance. Children in the lower age bracket exhibit more pronounced inter-speaker variability, resulting in a degraded performance in terms of EER and minDCF compared to the children in the higher age bracket. At the same time, the employed ASV system incorporating both the proposed data augmentation technique and feature concatenation is found to be more impactful for children in the lower age group, resulting in a significant reduction in EER and minDCF compared to the baseline.

As a future extension of this work, in addition to the out-of-domain data derived from adults’ speech, we would like to explore the effectuality of in-domain data augmentation techniques for the purpose of increasing the amount and diversity of the captured acoustic attributes of children’s speech for training. In-domain data augmentation refers to increasing the amount of children’s speech available for training by synthetically generating more data from children’s speech itself. In this regard, we would like to implement speed perturbation and pitch perturbation of the original children’s speech. In addition, we would also like to explore and incorporate the vocal tract length perturbation (VTLP) technique. VTLP approach explicitly models and compensates for the ill-effects of variations in vocal tract length by introducing diversity into the complete children speech data set by creating numerous sets of data with varying linear warping factors. The out-of-domain data augmentation techniques in tandem with the in-domain data augmentation techniques are anticipated to reduce the EER and minDCF values, which will eventually help in the realization of a more robust and dependable children’s ASV system.