1 Introduction

Speaker recognition, a biometric method, utilizes speech features to authenticate a user's uniqueness through automated analysis of voice signals. Over recent decades, Automatic Speaker Recognition (ASR) systems have advanced significantly, finding applications in forensics, banking, and security. These systems comprise preprocessing, feature extraction, and speaker modeling components. Preprocessing involves refining input signals by eliminating non-speech elements and performing tasks like pre-emphasis and endpoint detection [1] [2]. Feature extraction, termed "front end preprocessing," transforms voice signals into numerical characteristics essential for training and testing speaker recognition systems. Speaker modeling constructs methods for speaker feature matching, crucial in the recognition stage for identification or verification purposes. Thus, speaker recognition systems serve vital roles across various domains, ensuring efficient and secure user authentication [3] (Fig. 1).

Fig. 1
figure 1

A basic framework for an automated speaker recognition system

Speaker recognition systems often struggle in challenging acoustic environments due to factors like low audio SNR, diverse accents, and ambient noise, such as babble noise in crowded places. Conventional methods heavily rely on short-term spectral features like MFCC and Linear Prediction Cepstral Coefficients (LPCC), limiting their effectiveness in the presence of acoustic degradations. To address this, our research proposes a deep learning-based method called 1D-Frame Level-Feature Fusion-CNN. By combining MFCC with normalized pitch and phase features, this approach enhances recognition capabilities, even in scenarios with varying background noise strengths [4]. This research aligns with existing literature and offers promising advancements in speaker identification and verification techniques.

2 Overview of previous work

2.1 Earlier approach

Over the last decade, speaker recognition has undergone significant advancements, notably leveraging cepstral characteristics like MFCC [5]. Statistical and machine-learning methods such as Gaussian Mixture Model [6], Support Vector Machine (SVM) [7], and various score normalization techniques have been instrumental in speaker recognition systems. Recent improvements include the adoption of Gaussian Mixture Model-Universal Background Model (GMM-UBM) approaches [6], Support Vector (SV) techniques [8], and Factor Analysis-based engine voice (i-vector) architecture [9]. However, technical challenges persist in the domain, with environmental background noise and associated variations posing significant hurdles, particularly in scenarios with low signal-to-noise ratio.

The complex process of human speech involves various organs, yielding features indicative of pronunciation qualities in voice signals [10]. Speaker recognition algorithms integrate multiple speech characteristics to enhance accuracy [11]. Common feature extraction methods include LPCC, MFCC, Perceptual Linear Predictive Analysis, cepstrum differential coefficients, and RASTA filters [5, 12]. Spectrograms on the other hand offer a concise representation of acoustic features [13].

2.2 Deep learning approach

Recent advancements in speaker recognition, particularly with the adoption of deep learning, have significantly improved recognition rates and robustness [14]. MFCC, known for its resistance to noise and session variations, remains a cornerstone in this field [15]. Strategies for identifying similar MFCC feature vectors have been proposed [16], and CNN architectures have shown promise in enhancing accuracy [17]. Combining learned features with MFCC characteristics has yielded improved performance [18]. However, the computational demands of deep learning models remain a challenge [19], prompting the exploration of noise reduction techniques for robust speaker authentication.

Deep Neural Networks (DNNs) have demonstrated greater resilience to noise and acoustic reverberation compared to i-vectors, a machine learning approach incorporating GMM-UBM front-end with Probabilistic Linear Discriminant Analysis (PLDA) as the back-end classifier [20]. The benefits of voice-enhancing strategies with DNN embeddings in speaker recognition was investigated in [21]. El-Moneim et al. [22] focused on text-independent speaker recognition in noisy and reverberant environments, employing MFCCs, spectrum, and log-spectrum features analyzed by Long-Short Term Memory (LSTM)-Recurrent Neural Network (RNN) classifiers. Hourri et al. [23] proposed a novel method using CNN filters to extract speaker features, resulting in convVectors, which demonstrated enhanced performance under noise conditions.

A Two-level noise-robust PNN model (2LNR-PNN), addressing noise during preprocessing and feature extraction stages using spectral subtraction and GMM was introduced in [24], resulting in improved performance, reliability, and resilience in noisy and real-time scenarios. Hamidi et al. [25] utilized a Hidden Markov Model (HMM) based automatic speech recognition system to analyze cough signals, enabling the classification of coughs into sick or healthy category of speakers. AL-Shakarchy et al. [26] described a model designed to authenticate individuals based on their unique voice characteristics using deep learning techniques by leveraging the distinctive features present in their voices. Radha and Bansal [27] developed a child speaker identification system for non-native English speakers, evaluating fluency impact in text-dependent and text-independent tasks. Chelali [28] focused on audiovisual data fusion for robust speaker recognition in noisy environments, by extracting low-level features (LPC, MFCC for acoustic; ZM, HOG for visual) and fusing them to enhance modality efficiency.

3 Proposed methodology

Recognizing speakers in environments with minimal noise poses a challenge due to disruptions in crucial acoustical cues. This research aims to bolster system resilience and accurately identify desired speakers by simultaneously refining noise suppression techniques and speaker identification-verification procedures. This involves aligning learned features or enhanced speech signals with the necessary information for speaker identification-verification.

Speaker recognition involves identifying and confirming individuals based on their voice characteristics. This process facilitates tasks like personalized speech adaptation and speaker authentication for security purposes. Central to this process is feature extraction, which precisely characterizes speech signals amidst variations. By employing methods like Normalized Pitch information, MFCC, and Phase information, the feature extraction process translates acoustic input signals into patterns of acoustic feature vectors, providing a comprehensive depiction of speech signals. Deep Neural Networks are subsequently utilized to categorize speakers as either target or non-target based on these extracted features.

The proposed feature extraction process involves extracting cepstral features, Normalized Pitch Frequency, and phase information from the speech signal. Cepstral analysis, using Mel filter banks, decomposes speech signal frames into logarithmic spectral domain coefficients to model human ear effectiveness. Incorporating pitch frequency [29] with MFCCs [30] and phase information [31] aims to improve recognition outcomes. The classification process includes training and testing phases. During training, features from the enhanced speech signal train the Deep Neural Network model for each speaker. In testing, an unknown speaker’s model is compared with learned features to decide on identification-verification. Speech signals from the ITU-T P-series recommendations directory [32] are used, with various real-world noise signals introduced before recognition.

3.1 Convolutional neural network processing

The presence of significant quantities of training data has driven primarily significant advancements in deep learning. However, such data is rarely available for particular tasks like speaker recognition, where significant amounts of information cannot be collected in real situations. As a result, in this work, we propose recognizing speakers using just a few training sets. To accomplish this, we employ a deep neural network with the Mel cepstral coefficients, normalized pitch spectrum, and phase cepstral coefficients as input, depicted in Fig. 2.

Fig. 2
figure 2

Schematic overview of the proposed approach

Our strategy for speaker recognition employs a CNN that is designed primarily to learn speaker dependent attributes from fragments of Mel features, normalized pitch features, and phase cepstral features of clean speech, corrupted speech, and enhanced speech. We developed a CNN-based feature level fusion method for combining and projecting speech attributes from the MFCC, Normalized Pitch, and Phase feature spaces into a d-dimensional joint feature space (explained in the later section). The value of d here is determined by the CNN architecture. The joint feature space is learned so that the joint feature representation encompasses highly discriminative speaker-dependent speech attributes, thereby enhancing speaker recognition accuracy. Before the feature extraction phase, we employ the speech enhancing method [33], to suppress the impact of noise on speech encountered in real-world scenarios. We deal with three instances including the clean speech, noise-corrupted speech and enhanced version of speech obtained from the method [33] for speaker identification-verification tasks. We will concentrate on text-independent speaker recognition throughout this work because it represents a more generalized form and has significant usage in a wide range of applications.

4 Analysis of proposed method

The described procedures extract 40-dimensional Mel, normalized pitch, and normalized phase cepstral feature frames from speech frames. Each MFCC feature frame consists of 20 mel-cepstral coefficients (including the zeroth order coefficient), 20 first-order delta coefficients, 40 phase cepstral coefficients, and a normalized pitch. Cepstral Mean Variance Normalization (CMVN) is applied for feature normalization, enhancing generalizability in experiments. The number of speech frames obtained from a single voice file depends on the sampling frequency and voice length. For training the CNN with fixed-dimensionality input, 200 consecutive feature frames, termed “feature patches,” are randomly sampled from each voice signal in every batch, resulting in feature patches of size 40 × 200. These MFCC, Normalized Pitch, and NPCC patches are stacked along the three-dimensional space to form a 40 × 200 × 3 dimensional, three-channel feature patch named MFCC-NP-NPCC. Each channel represents MFCC, NP, and NPCC patches respectively. These features are integrated using 1D convolution filters in the CNN architecture, as illustrated in Fig. 3.

Fig. 3
figure 3

A schematic of the proposed 1D-frame level-feature fusion-CNN architecture’s feature fusion

The CNN's objective is to transform each MFCC-NP-NPCC feature frame, with its 3-channel, 40-dimensional representation, into a 128-channel, 1-dimensional frame-level feature embedding. This 128-dimensional Joint Feature Space encapsulates speaker-dependent information linked to the input features. The arrangement of convolutional layers in a CNN significantly impacts its learning capability and effectiveness, as each layer learns distinct concepts from the data and refines information for deeper layers. ReLU non-linearity is applied to filter observations from each convolutional layer, mitigating the vanishing gradient problem commonly encountered with sigmoid activation functions. Additionally, max-pooling is employed to reduce the dimensionality of the network's learned space. Dropout layers are incorporated into the CNN during training to introduce regularization, offering the dual benefit of enhancing the CNN’s resilience to input data variations while mitigating overfitting issues with the training data.

4.1 Speaker Identification Procedure

As shown in Fig. 3, during the testing phase, the input MFCC-NPF-NPCC feature strip \(\mathcal{X}\), is divided into MFCC-NPF-NPCC patches, \({\mathcal{x}}_{\mathfrak{i}}, \mathfrak{i}\epsilon \left\{\mathrm{1,2},\dots ,\mathcal{N}\right\}\), where \(\mathcal{N}\), represents the number of patches. The CNN returns a series of classification scores, \(\left\{{\mathcal{s}}_{\mathfrak{i},\mathfrak{j}}\right\}, \mathfrak{j}\epsilon \left\{\mathrm{1,2},\dots ,\mathcal{S}\right\}\), for each input MFCC-NPF-NPCC patch, \({\mathcal{x}}_{\mathfrak{i}}\), pertaining to the \(\mathcal{S}\) speakers. The classification score attributed to the \(\mathfrak{j}\)th speaker for the \(\mathfrak{i}\)th patch is represented by \({\mathcal{s}}_{\mathfrak{i},\mathfrak{j}}\). The combined classification scores, or \(\{{\mathcal{S}}_{\mathfrak{j}}\}\), for the complete speech signal are obtained by adding the results from each of the patches that were extracted from the speech signal, represented by Eq. 1, as:

$${\mathcal{S}}_{\mathfrak{j}}=\sum_{\mathfrak{i}=1}^{\mathcal{N}}{\mathcal{s}}_{\mathfrak{i},\mathfrak{j}},{\forall }_{\mathfrak{j}}$$
(1)

The speaker \({\mathfrak{j}}^{*}\) designated for the input speech signal is then chosen, given by Eq. 2, as:

$${\mathfrak{j}}^{*}=\underset{\mathfrak{j}}{\mathrm{arg max}}\{{\mathcal{S}}_{\mathfrak{j}}\}$$
(2)

4.2 Speaker verification procedure

For the verification of the intended speaker, Cosine Triplet Embedding Loss function is employed. In our scenario, we use the cosine similarity criterion, which offers superior learning dynamics than the Euclidean criterion and corresponds to the research in [32]. The cosine triplet embedding loss for training the model is represented by Eq. 3, as:

$$l \left({\mathcal{S}}_{\hat{c} 1},{\mathcal{S}}_{{\acute{n}}1},{\mathcal{S}}_{\hat{c} 2}\right)=\sum_{\hat{c} 1,{\acute{n}}1,\hat{c} 2}^{\mathcal{N}}{\text{cosine}}\left({\text{f}}\left(\hat{c} 1,{\acute{n}}1\right)\right)-{\text{cosine}}\left({\text{f}}\left(\hat{c} 1,\hat{c} 2\right)\right)$$
(3)

Here, \(l \left( \right)\) represents the Cosine Triplet Embedding Loss function, \(\hat{c} 1\) corresponds to the clean speech utterance related to speaker 1, \({\acute{n}}1\) corresponds to the utterances of the noise-corrupted speech utterance related to speaker 1, and \(\hat{c} 2\) corresponds to the clean speech utterance related to speaker 2.

A whole essential criterion considered is that, despite the fact that the speech signal is changing constantly, the speaker-dependent vocal attributes are presumed to be quasi-stationary only over brief periods of time (15–35 ms) [34]. As a result, as mentioned in the feature extraction phase, we perform on short-term voice segments known as “voice frames”. The MFCC, NPF, or NPCC features that correspond to that voice frame are referred to as the “Feature Frame”. Therefore, a feature frame derived from a voice frame reflects just that voice frame's characteristics and has no correlation with its adjacent frames from the perspective of speaker recognition. Considering the above attribute limits, we specified the incorporation of 1D convolutional filters in conjunction with the feature frame in our Convolutional Neural Network for learning speaker-specific characteristic attributes of the concerned speech.

5 Experimental approach

5.1 Experimental setup

Our suggested approach was evaluated using the ITU-T speech datasetFootnote 1 [32] for clean speech signals. The collection from the ITU-T speech dataset contains 16 recorded sentences in every one of the 20 languages. Also, every set (or subset) contains half male speaker recordings and half female speaker recordings. The speech samples, initially at a 16 kHz sampling rate, were downsampled to 8 kHz to minimize the computational limitations of the system. The noise signals were chosen from the NOISEX-92 datasetFootnote 2 [35], including factory noise. Each of the resulting datasets was produced at one of three SNR levels: 0 dB, 5 dB, 10 dB, or 15 dB. For enhancing the speech signal, the method employed in [33] was incorporated as a preprocessing approach to deal with noise-corrupted speech samples from the speakers. Thus, three variants of speech signals, including clean speech samples, noise-corrupted speech samples, and enhanced speech samples, were provided as input to the proposed speaker recognition system. Because the text uttered by the speakers in the training and testing sets differ, the speaker recognition experimental studies are text-independent. Table 1 illustrates the performance evaluation of the proposed approach using state-of-the-art techniques.

Table 1 Performance evaluation of the proposed approach using state-of-the-art techniques

5.2 Results and discussion

In this section, we evaluated the text-independent speaker recognition observation employing MFCC, NPF, NPCC information. Figures 4, 5, and Table 2 displays the results of the independent method for recognizing speakers in terms of Identification Accuracy (ID in %), Equal Error Rate (EER in %), False Acceptance Rate (FAR), and False Rejection Rate (FRR) for the factory noise scenario.

Fig. 4
figure 4

Speaker Identification results (in %) for noise-corrupted speech and enhanced speech (factory noise)

Fig. 5
figure 5

Speaker verification results (%EER) with MFCC, NPF, and PCC as the input features (factory noise)

Table 2 False acceptance rate (FAR) and false rejection rate (FRR) results (factory noise)

5.2.1 Speaker identification results

Figure 4 depicts the identification accuracy results of the proposed approach in comparison with the state-of-the-art techniques for factory noise, for noise-corrupted speech and enhanced speech.

Even when the SNR is 0 dB for factory noise, it is evident that the suggested method outperforms other baselines in different noise scenarios with accuracy of 40.5%, 41.6% for 5 dB, 42.8% for 10 dB, and 44.9% for 15 dB SNR variations. Furthermore, by incorporating an enhancement strategy, the proposed method improves identification performance even further with 93.8% identification accuracy at 0 dB, 94.7% at 5 dB, 95.4% at 10 dB, and 96.1% at 15 dB SNR levels, respectively. Before speaker identification, speech noise suppression is used, and a joint optimization is performed, which filters out some noise disruptions. The speaker-dependent speech improvement is implemented as well. With the exception of speaker-independent noise elimination, the incorporation of speaker knowledge not only recovers some of the noise-corrupted speech signals but also reveals speaker-specific characteristics that are important for speaker recognition.

5.2.2 Speaker verification results

Figure 5 depicts the speaker verification accuracy results of the proposed approach in comparison with the state-of-the-art techniques for factory noise, for noise-corrupted speech and enhanced speech.

The results of speaker verification are presented in the form of an Equal Error Rate (EER). The noise-corrupted and enhanced utterances are observed under four different SNR conditions. The proposed approach clearly benefits from speech noise suppression in all situations. The advantage is greatest at all the SNR levels, where the EER scores for noise-corrupted utterances are relatively high. At an input SNR of 0 dB, the EER score for noise-corrupted speech estimates to 16.11%, 14.21% for 5 dB, 10.39% for 10 dB, and 7.18% for 15 dB SNR levels. Using the proposed method for enhanced speech utterances, this score is reduced to 7.13% for 0 dB, 5.47% for 5 dB, 3.98% for 10 dB, and 2.25% for 15 dB SNR values, respectively. In comparison to prior speaker verification systems, the proposed method outperforms them in all conditions. As a result, using MFCC in conjunction with NPF and PCC enhances speaker verification performance uniformly over existing methods at all input SNR levels.

Table 2 displays the results of the performance assessment for noise-corrupted speech and enhanced speech under the influence of factory noise in terms of False Acceptance Rate (FAR) and False Rejection Rate (FRR). For factory noise, the presented approach successfully accomplished promising outcomes of 0.35942 FAR and 0.11096 FRR at 15 dB SNR for noise-corrupted speech. For the enhanced for of speech utterance, the FAR achieved is 0.43205 and FRR obtained is 0.11217 at 15 dB SNR level condition. Under all SNR conditions and noise variants, the proposed approach outperforms the existing techniques.

6 Conclusion and future scope

Noise in speech data frequently misrepresents the speaker-dependent features present, complicating speaker identification and verification approaches. Because MFCC is not very resistant to audio degradation processes as a speech classification process, the speaker recognition performance of methods that depend exclusively on MFCC attributes will struggle in the presence of noise encountered in real-time scenarios. Conversely, as demonstrated by the experimental observations, the proposed CNN classifier with the input features of MFCC, NPF, and PNCC is robust to a wide spectrum of audio damages. In terms of identification accuracy, equal error rate, false acceptance rate, and false rejection rate, the proposed technique significantly outperformed all standard procedures by a significant margin.

The future of voice enhancement and speaker recognition is defined by the incorporation of sophisticated signal processing technologies, such as deep learning architectures, as well as the investigation of multimodal approaches that combine auditory and visual information for increased accuracy. Adaptive systems capable of dynamically responding to ambient elements and user context are planned, coupled with attempts to improve resilience against numerous sources of variability such as accents, noise, and channel distortions. As these technologies become more widely used, there will be a greater emphasis on privacy and security concerns. Real-time applications in a variety of fields, including healthcare, automotive, security, and customer service, will push the development of efficient algorithms and hardware implementations, allowing for seamless integration into common devices and systems.