1 Introduction

Automatic speaker recognition (ASR) is the approach used to automatically authenticate the identity of speaker by exploiting speaker-specific characteristics included within the speech waves, allowing accessibility for various applications [1, 2]. The speaker recognition is categorized into two sub-areas: speaker identification and speaker verification. The technique of identifying which of the enrolled speakers delivered a given utterance is known as speaker identification. The technique of admitting or refusing a speaker’s reported identity is known as speaker verification. Furthermore, depending on the type of speech, speaker identification systems are classified as: text-dependent (defined passwords) and text-independent (undefined passwords). Text-dependent speaker identification (TD-SI) models are typically template-based or sequence-matching strategies in which the recognition passages are preset or available in advance [3]. However, preset phrases aren’t really allowed in diverse applications, notably forensic investigations and surveillance. Furthermore, regardless of the subject of the speech, humans can distinguish speakers. Therefore, text-independent speaker identification (TI-SI) models have gained attention, as it is more difficult to imitate an unknown utterance [4]. In fact, social media platforms are the most popular among teens and young adults, with over half of youngsters aged 8–17 are using the internet and maintaining accounts on social media website [5]. However, several speech technology applications face difficulty in identifying non-native children due to various reasons of child’s immaturity in the areas of cognition, phonology, and physiological development (vocal tract) [6, 7]. In particular, the pitch of children’s voice is higher, and perceptual elements like formants appear at higher frequencies [8]. Despite, significant research focuses on automated identification of adult speakers, concerns and procedures relevant to speaker identification in non-native children is still elusive and exposed to numerous exterior risks.

The broadly utilized feature engineering techniques in speech applications include conventional mel frequency cepstral coefficients (MFCC) and linear predictive cepstral coefficients (LPCC) [9]. A discrete cosine transform (DCT) is responsible in the back-end to minimize redundancy in the speech signal and dividing them into feature coefficients with small dimensions, however, it has a limitation of fixed resolution. Contrarily, the loss of high-frequency information caused by these feature extraction techniques is compensated by the multi-resolution (multi-scale) offered by the wavelet scattering transform (WST) [10]. The key contribution of this study is to identify the non-native child speaker in both text-dependent (TD) and text-independent (TI) speech tasks particularly in multiple classes, because as population grows larger, the performance of the model drops.

The details of the article are as follows: Sect. 2 provides a comprehensive insight into recent state-of-the-art models in children speaker identification. The proposed framework is presented in Sect. 3 and includes details of the speech corpus, wavelet scattered feature extraction, and Bi-LSTM model in multiple class classification. The details of the experimental setup, evaluation metrics, and reported outcomes of the non-native child speaker identification in both TD and TI speech tasks are given in Sect. 4. Lastly, Sect. 5 concludes with an expansion of future work.

2 Related work

Humans consider a variety of cues to identify the speaker, notably voice pitch, a particular laughter, and frequently used words [11]. However, the speaker identification system that has been built so far relies on short-term characteristics of the speech spectrum. Despite the fact that a lot of progress has indeed been made on para-linguistic analysis (age, gender, language, nativity, emotion, identity) of adult speaker identification [12,13,14]. It is astonishing that, there is comparatively less research done on children’s speech in this field because para-linguistic analysis of children has a lot of significant challenges. They include high spectral and temporal randomness due to “the onset of puberty” [15], immature vocal tracts, the development of fine motor skills during language acquisition, and high fundamental frequencies than those of adults [16]. Children’s speech technology is regulated by nativity and accent, as distinguished by para-linguistic characteristics [17]. From literature, modern speaker identification systems commonly deploy MFCC features trained with Gaussian mixture model (GMM) based methods to identify the speaker [2]. This well-known study was used in speaker recognition for children’s speech, with OGI kids corpus trained on Gaussian mixture model-universal background model (GMM-UBM) and Gaussian mixture model-support vector machine (GMM-SVM) models [18] to obtain an accuracy of 90–99% for children between the age of 5 and 16. Additionally, employing GMM-UBM systems, the study on children’s speech has continued to identify child speakers by gender and age group [19]. Recently, a work presented on children automatic speaker verification (ASV) in low resource condition using i-vector, x-vector and probabilistic linear discriminant analysis harnessed with audio augmentation techniques [20]. The success of wavelet scattered feature extraction integrated with a convolution neural network (CNN) for end-to-end adult speaker identification [21, 22] inspired us to apply the scattering network integrated with recurrent neural network (RNN) to the suggested non-native child speaker identification system. Wavelet scattering network (WSN) is used as a feature extractor in this study on TI-SI, overcoming the limitation of MFCCs. A typical approach found that, using a small amount of TI speech data to train Bi-LSTM was successful in large-scale speaker identification system [23].

Although speaker identification systems are effective in situations where there are few classes, as the number of classes grows, it becomes more difficult to identify the speaker, particularly in children who might not be native speakers. Therefore, the main objective of this work is to develop accurate methods for identifying non-native child speakers using deep RNNs and to evaluate the significance of information in filter coefficients of wavelet scattering networks.

3 Proposed system overview

This section presents a comprehensive procedure for identifying non-native child speakers across multiple classes in TD and TI speech tasks, making it the first article to use wavelet scattering transforms in children’s speech. Raw audio files are first transformed into wavelet feature coefficients of order 2 using the wavelet scattering network, which serves as a feature extractor. A Bi-LSTM model of RNN is then trained on these features to solve sequential problems and acquire long-term dependence on inputs [24]. After training the Bi-LSTM model on the wavelet feature coefficients extracted from the raw audio files, the model can be used to predict the speakers in a separate test set, as shown in Algorithm 1. The process of predicting the speakers involves matching the test features with the trained features as shown in Fig. 1. The trained features are first fed into the Bi-LSTM model to obtain the corresponding sequence of hidden states, which are then used to predict the speakers using a softmax classifier. The test feature is then assigned to the class with the highest probability of the predicted speakers, based on the softmax output of the Bi-LSTM model.

figure a

3.1 Dataset

A total of 20 children, 11 females and 9 males aged 7–12 years are considered for data collection. All of the children are non-native English speakers who speak native Telugu, an Indian regional language, in addition to English. The data comprises two types of speech: TD speech and TI speech. The children repeat the preset phrases presented on the screen in the TD speech activity (e.g: “Everyone in my school likes icecream”). In the TI speech task, children narrate a picture (e.g: pencil) on the screen in their own words (e.g: “This is a pencil. I will write my homework”).

The dataset consists around 3.32 h of children’s speech with 2898 recordings, 2173 from the TD speech task and 725 from the TI speech task. It is evident that TI speech data is less than TD speech data because children are more comfortable reading the text presented on screen rather than narrating a picture during data collection. This is because non-native children are speaking English as a medium of instruction, and the emergence of Covid-19 has severely hampered children’s learning and development in L2 proficiency [25]. All of the audio files are in .wav standard and use a stereo channel at 44.1KHz sampling rate and 16 bits per sample. Environmental settings such as door closing, bell ringing, and other children chatting in adjacent classrooms are also included because the audio recordings of non-native children were acquired during class days. Detailed description of database is provided at Non-Native Children Speech Mini Corpus [26].

Fig. 1
figure 1

Proposed child speaker identification system using multi-scale recurrent neural networks

3.2 Wavelet scattering transform

In natural settings, speech signals produced by children often demonstrate gradual changes or fluctuations that are occasionally interrupted by abrupt transients [27]. In such instances, wavelet scattering transform (WST) is employed as shown in Eq. (1), which may reflect the local features of signals in both the time and frequency domains providing multi-resolution [28, 29]. The WST is simply a bunch of convolutions with a series of band-pass filters (wavelet filter bank), non-linearity (modulus), and pooling [30]. Each layer must be capable of performing linear and non-linear operations. Initially, in each layer, the convolution linear operation of the specified complex wavelet filters with the input signal is accomplished, followed by the modulus non-linearity is applied. Finally, a low-pass filter is used to determine the local average or pooling.

$$\begin{aligned} S_{W}(\alpha ,\beta )= \frac{1}{\sqrt{\alpha }}\int _{-\infty }^\infty {\widehat{\varphi }}\left( \frac{t-\beta }{\alpha }\right) s(t)\textrm{d}t \end{aligned}$$
(1)

where \(S_{W}(\alpha ,\beta )\) is WST of s(t) which is decomposed into its basis functions with \(\alpha\) be the binary or dyadic scale and \(\beta\) be the binary or dyadic shift and \({\widehat{\varphi }}(.)\) denotes complex conjugate of wavelet function [31].

The first-order scattering coefficients are mathematically written as Eq. (2):

$$\begin{aligned} W_I[\lambda _1]S=|S*\varphi _{\lambda _1}|*\rho _I(t) \end{aligned}$$
(2)

where S is the input speech, \(\varphi _{\lambda _1}(t)\) denotes dilated wavelet at \({\lambda _1}\) center frequency of first order bandpass filter, \(\rho _I(t)\) is the averaging low-pass filter and * indicates the convolution operator. The high-frequency coefficients are obtained by local averaging by \(\rho _I(t)\) to maintain better stability, however, this results in a loss of high-frequency data. The wavelet modulus is fragmented further in second-order scattering as shown in Eq. (3) and in order to preserve high-frequency information.

$$\begin{aligned} W_I[\lambda _1,\lambda _2]S=||S*\varphi _{\lambda _1}|*\varphi _{\lambda _2}|*\rho _I(t) \end{aligned}$$
(3)
Fig. 2
figure 2

Decomposition using wavelet scattering network. Red color denotes non-linear modulus after convolution with series of band-pass filter at center frequency (\(\lambda\)) & blue colour denotes outputs obtained via low-pass filter (\(\rho _I\))

According to the first-order coefficients, these coefficients are averaged using a low-pass filter \(\rho _I\) of length T, which provides local invariance to time shifts.

Similarly, in the \(m^{th}\) layer, the sequence of n pre-defined wavelet paths \(\varphi _{\lambda _1},\varphi _{\lambda _2} \dots \varphi _{\lambda _m}\) are convoluted using coefficients from the \((m-1)^{th}\) layer, which correspond to a prior wavelet \({\varphi _{\lambda _{m-1}}}\). Figure 2 depicts an example of WST for input speech with the scattering order, M=4 and scale, i is 4, and \(m = 0, m = 1, m = 2\), and \(m = 3\) represents the filter bank order.

3.3 Bidirectional long-short term memory (Bi-LSTM)

Recurrent neural networks (RNN) were shown to be extremely effective at solving sequential problems [32, 33]. Long short-term memory (LSTM) networks were found in 1997 by Hochreiter and Schmidhuber [34] as an extension of RNN. There is rich literature available on LSTM due to the ability to learn long-term dependence of inputs by simply memorizing information over a long period of time and also competent in dealing with the vanishing/exploding gradient problem [35]. This is possible by incorporating an input gate (\(\hat{\textsf{g}}_i^{(t)}\)), an output gate (\(\hat{\textsf{g}}_\theta ^{(t)}\)), a forget gate (\(\hat{\textsf{g}}_f^{(t)}\)), and a memory cell (\({C}^{(t)}\)) in the LSTM network [36], as shown in the Fig. 3a. The memory cell is made up of two layers: the sigmoid layer, which is responsible for updating weights, and the tanh layer, which is responsible for formulating new cell state vectors (\({\hat{C}}^{(t)}\)), that add to memory cells.

Fig. 3
figure 3

a Basic details of LSTM cell. b Recurrent neural network with Bi-LSTM layer

Bidirectional LSTM (Bi-LSTM), is an updated version of LSTM model that employ two LSTMs on the input data [37]. The proposed network learns not only from input cell (\(C^{(t-1)}\)) to the output cell ( \(C^{(t)}\)) in the forward direction using Eq. (4), but also from output cell to input cell in the backward direction using Eq. (5), via two different hidden layers [38], inorder to preserve previous and future information as depicted in Fig. 3b.

$$\begin{aligned}{} & {} \overrightarrow{h}^{(t)}=tanh(U_{\overrightarrow{h}}[\overrightarrow{h}^{(t-1)}, x^{(t)}]+b_{\overrightarrow{h}}) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} \overleftarrow{h}^{(t)}=tanh(U_{\overleftarrow{h}}[\overleftarrow{h}^{(t-1)}, x^{(t)}]+b_{\overleftarrow{h}}) \end{aligned}$$
(5)

Here \(x^{(t)}\) denotes current input, \(\overrightarrow{h}^{(t)}\) is the forward hidden sequences from \(t= 1\) to T, and \(\overleftarrow{h}^{(t)}\) is the backward hidden sequence from \(t= T\) to 1, \(U_{\overrightarrow{h}}\) & \(U_{\overleftarrow{h}}\) are the corresponding weights with bias vectors \(b_{\overrightarrow{h}}\) and \(b_{\overleftarrow{h}}\).

Finally, the output \(y^{(t)}\), is obtained by appending the two hidden states with an output bias vector \(b_y\) as shown in Eq. (6).

$$\begin{aligned} y^{(t)}=U_{\overrightarrow{h}}\overrightarrow{h}^{(t)}+U_{\overleftarrow{h}}\overleftarrow{h}^{(t)} + b_y \end{aligned}$$
(6)

Implementing the gated cell (LSTM) twice enhances the ability to learn long-term dependencies and, as a result, improves prediction performance. A more detailed description of Bi-LSTM network configuration is discussed in Sect. 4.3.

4 Results and discussions

This section describes the experimental settings and outcomes in developing the child speaker identification model. Initially, the datasets were randomly split into train and test sets both in TD and TI speech. To gauge the overall system performance, average values of accuracy, precision, and recall metrics were employed and compared with the existing work. MATLAB R2022a was used for all investigations of children’s speech, including feature extraction, modeling, and identification.

4.1 Data processing

This article incorporates, a closed-set child speaker identification system that integrates TD and TI speech to promote voice biometric applications [39,40,41]. The overall corpus length is 3.34 h (\(\sim\)200 min) and comprises 2898 utterances, of which 2173 are TD speech data and 725 are TI speech data. Furthermore, as indicated in Table 1, each type of speech is randomly divided into 80% and 20% as training and testing data, respectively.

Table 1 Details of train and test split in non-native children speech corpus

4.2 Feature extraction

Feature extraction is an important process in any recognition system since it provides the system to reliably identify speakers by transforming the raw data into feature vectors. The most popular feature extraction methods used in automatic speaker recognition systems are the MFCC and LPCC [42]. Filters with nonlinearly variable bandwidths are imposed over the STFT in MFCCs. Filter implementation may be considered as frequency domain pooling that negates the effect of instability. Unfortunately, this frequency pooling entails loss of signal information at higher frequencies [43]. To counteract the loss of high frequency information in MFCC, the WST approximates second order coefficients of the speech signal.

In the proposed speaker identification models, WST approach employs the Gabor (analytic Morlet) wavelet [31] was generated to a depth of 2. The very first layer comprised with eight Gabor wavelets per octave, whereas the second included one Morlet wavelet per octave and represented as quality factors Q = [8 1]. The wavelets use a low pass scaling function to produce interpretations of children’s speech with low variation. The time scale for invariance is set at 500 ms and a batch size of 64. The coefficients are therefore standardized and log-transformed. As a result, analyzing children speech by employing the first and second orders filter banks of the WST extends the Mel filter bank representation, while conserving the information. Finally, the scattering network obtained has 411 paths by 25 scattering time windows for each audio signal.

4.3 Training network configuration

The proposed child speaker identification system was developed and trained using the Bi-LSTM model classifier. The training network comprises of five layers, with the extracted scattering coefficients from the wavelet scattering network served into a 5-layer RNN. The input layer is a sequence layer that has 410 dimensions, followed by a Bi-LSTM layer with 512 hidden units and a 20-fully connected layer. The softmax function is required as the activation function in the output layer, which predicts a multinomial probability [44] for multiple speaker. Finally, a cross entropy is employed in the classification layer to update the network via computing the difference between the output and target multinomial probabilities is as shown in Fig. 3b. The proposed model was trained using “Adam (adaptive moment estimation)” [45], an optimization algorithm with an adaptive learning rate that evaluates individual learning rate for various parameters. The details of learning options used in both TD and TI speaker identification system are depicted in Table 2.

Table 2 Training parameters of Adam optimizer
Table 3 Evaluation metrics in multiple class (J) speaker identification system

4.4 Evaluation metrics

The performance measures considered to evaluate the proposed non-native children identification system in TD and TI tasks are average accuracy, average precision, average recall, and average F1 score [46] as shown in Table 3, since this approach is employed to recognize children in multi-class problem [47]. Moreover, area under the receiver operator characteristic curve (AUROC) or ROC index is mostly used to measure the model’s scoring ability. This is a likelihood graph that effectively separates the signal from the noise by analyzing the true positive rate (TPR) to the false positive rate (FPR) at various threshold levels [48].

Fig. 4
figure 4

Confusion matrix of text-dependent speaker identification system

4.5 Text-dependent speaker identification (TD-SI)

To investigate the performance of speaker identification system in TD speech task of non-native children, Bi-LSTM network is trained on 1736 utterances and tested with 437 utterances from 11 female children and 9 male children. The training method consists of 13 iterations per epoch for 50 epochs, for a total of 650 iterations all through the training cycle. The TD-SI system achieves an overall average accuracy of 97.94% with a misclassification of 2.1%, primarily owing to the F6 speaker, who has a similar voice to F5 and F2 due to age resemblance. Male children M6 and M9 are often misclassified as female child F6 because of the fact that, occasionally male children whose vocals have not changed as a result of puberty may attain a close match with the females [15]. Overall average accuracy may not provide a deep insight into system performance because of the imbalance in the dataset in recognizing non-native speakers; consequently, the F-measure is used as a combined measure (Harmonic Mean) of precision and recall. In TD speech task, the F-measure of non-native speaker identification is 98.01%, with a respective precision of 98.02% and a recall of 98.01%, as displayed in the confusion matrix in Fig. 4. Figure 6a and Fig. 6b show the ROC curves for the female and male classes in TD-SI tasks, respectively.

Table 4 Performance measure of speaker identification system in terms of average accuracy (\(\bar{A_v}\)), average precision (\({\bar{P}}\)), average recall (\({\bar{R}}\)), average F-measure (\({\bar{F1}}\)-score)
Fig. 5
figure 5

Confusion matrix of text-independent speaker identification system

4.6 Text-independent speaker identification (TI-SI)

The reliability of the speaker identification model in TI speech of non-native children is shown in Table 4. The Bi-LSTM network is trained on 580 utterances of children aged 7-12 years and evaluated against 145 imbalanced utterances from 20 speakers. The training procedure involves 4 iterations per epoch for 50 epochs, results in maximum of 200 iterations in the training cycle. The TI-SI system achieves an overall average accuracy of 96.6% with an error rate of 3.4%. Two F10’s utterances are incorrectly identified as F8, one F4’s utterance is incorrectly anticipated as M5’s, and two F9’s utterances are poorly predicted as F6’s and F8’s, respectively. The above misclassification is regarded to be largely from female children aged 7 and 8, because the females in this age group have comparable pitch. As this TI-SI model includes limited test data, perhaps a misclassification of one utterance results in a considerable loss, as illustrated in the confusion matrix in Fig. 5. The ROC curves for the female and male classes in TI-SI tasks, is depicted in Fig. 6c and 6d respectively. Finally, the F-measure is also employed as a consolidated unit of precision and recall. In TI speech task, the F-measure of non-native speaker identification is 97.1%, with a respective precision of 97.32% and a recall of 96.94%.

Fig. 6
figure 6

a, b Represents ROC curves for text-dependent speaker identification. c, d Represents ROC curves for text-independent speaker identification

4.7 Comparative analysis of proposed system with earlier work

This section provides a much more detailed comparison of current state-of-the-art models for identifying child speakers, including information on the speech corpus type, feature extraction method, modeling strategy, and accuracy results. Additionally, adult speaker identification systems using wavelet scattered features are also compared. The proposed model outperforms earlier approaches in children’s speech as depicted in Table 5 in both TD and TI speech tasks.

Table 5 Comparative analysis of earlier state-of-the-art results in speaker identification

5 Conclusion and future work

In this study, non-native child speaker identification systems were built using WSTs as feature vectors, which directly acquired speaker discriminatory information from raw speech signals. A corpus of non-native children learning English as a second language is being collected and examined in natural environmental settings in order to benefit children’s speech technology. The proposed system proved to work reasonably well for both TD and TI speech scenarios in identifying over 20 unique children. The experimental findings showed that both the TD-SI and TI-SI models performed almost equally well with an overall accuracy of 97.9% and 96.6% respectively using scatter features trained on Bi-LSTM RNNs. The outcome of the study shows that the suggested closed set speaker identification method is reliable and efficient. The significant findings indicate that this system may be deployed to a variety of tasks, including control and security access. Future developments of the proposed study will include the automatic identification of speaker, nativity, age, and gender for a large corpus of non-native children’s speech.